bluesearch.mining.entity module¶
Classes and functions for entity extraction (aka named entity recognition).
- class PatternCreator(storage=None)[source]¶
Bases:
object
Utility class for easy handling of patterns.
- Parameters
storage (None or pd.DataFrame) – If provided, we automatically populate _storage with it. If None, then we start from scratch - no patterns.
- _storage¶
A representation of all patterns allows for comfortable sorting, filtering, etc. Note that each row represents a single pattern.
- Type
pd.DataFrame
Examples
>>> from bluesearch.mining import PatternCreator >>> >>> pc = PatternCreator() >>> pc.add("FOOD", [{"LOWER": "oreo"}]) >>> pc.add("DRINK", [{"LOWER": {"REGEX": "^w"}}, {"LOWER": "milk"}]) >>> doc = pc("It is necessary to dip the oreo in warm milk!") >>> [(str(e), e.label_) for e in doc.ents] [('oreo', 'FOOD'), ('warm milk', 'DRINK')]
- add(label, pattern, check_exists=True)[source]¶
Add a single raw in the patterns.
- Parameters
label (str) – Entity type to associate with a given pattern.
pattern (str or dict or list) –
The pattern we want to match. The behavior depends on the type.
str
: can be used for exact matching (case sensitive). We internally convert it to a single-token pattern {“TEXT”: pattern}.dict
: a single-token pattern. This dictionary can contain at most 2 entries. The first one represents the attribute: value pair (“LEMMA”: “world”). The second has a key “OP” and is optional. It represents the operator/quantifier to be used. An example of a valid pattern dict is {“LEMMA”: “world”, “OP”: “+”}. Note that it would detect entities like “world” and “world world world”.list
: a multi-token pattern. A list of dictionaries that are of the same form as described above.
check_exists (bool) – If True, we only allow to add patterns that do not exist yet.
- drop(labels)[source]¶
Drop one or multiple patterns.
- Parameters
labels (int or list) – If
int
then represent a row index to be dropped. Iflist
then a collection of row indices to be dropped.
- classmethod from_jsonl(path)[source]¶
Load from a JSONL file.
- Parameters
path (pathlib.Path) – Path to a JSONL file with patterns.
- Returns
pattern_creator – Instance of a
PatternCreator
.- Return type
- static raw2row(raw)[source]¶
Convert an element of patterns list to a pd.Series.
The goal of this function is to create a pd.Series with all entries being strings. This will allow us to check for duplicates between different rows really quickly.
- Parameters
raw (dict) – Dictionary with two keys: “label” and “pattern”. The pattern needs to be a list of dictionaries each representing a pattern for a given token. The label is a string representing the entity type.
- Returns
row – The index contains the following elements: “label”, “attribute_0”, “value_0”, “value_type_0”, “op_0”, “attribute_1”, “value_1”, “value_type_1”, “op_1”, …
- Return type
pd.Series
- static row2raw(row)[source]¶
Convert pd.Series to a valid pattern dictionary.
Note that the value_{i} is always a string, however, we cast it to value_type_{i} type. In most cases the type will be
int
,str
ordict
. Since this casting is done dynamically we use eval.- Parameters
row (pd.Series) – The index contains the following elements: “label”, “attribute_0”, “value_0”, “value_type_0”, “op_0”, “attribute_1”, “value_1”, “value_type_1”, “op_1”,
- Returns
raw – Dictionary with two keys: “label” and “pattern”. The pattern needs to be a list of dictionaries each representing a pattern for a given token. The label is a string representing the entity type.
- Return type
dict
- to_df()[source]¶
Convert to a pd.DataFrame.
- Returns
Copy of the _storage. Each row represents a single entity type pattern. All elements are strings.
- Return type
pd.DataFrame
- to_jsonl(path, sort_by=None)[source]¶
Save to JSONL.
- Parameters
path (pathlib.Path) – File where to save it.
sort_by (None or list) – If None, then no sorting taking place. If
list
, then the names of columns along which to sort.
- to_list(sort_by=None)[source]¶
Convert to a list.
- Parameters
sort_by (None or list) – If None, then no sorting taking place. If
list
, then the names of columns along which to sort.- Returns
A list where each element represents one entity type pattern. Note that this list can be directly passed into the EntityRuler.
- Return type
list
- check_patterns_agree(model, patterns)[source]¶
Validate whether patterns of an existing model agree with given patterns.
- Parameters
model (spacy.Language) – A model that contains an EntityRuler.
patterns (list) – List of patterns.
- Returns
res – If True, the patterns agree.
- Return type
bool
- Raises
ValueError – The model does not contain an entity ruler or it contains more than 1.
- global2model_patterns(patterns, entity_type)[source]¶
Remap entity types in the patterns to a specific model.
For each entity type in the patterns try to see whether the model supports it and if not relabel the entity type to NaE.
- Parameters
patterns (list) – List of patterns.
entity_type (str) – Entity type detected by a spacy model.
- Returns
adjusted_patterns – Patterns that are supposed to be for a specific spacy model.
- Return type
list