bluesearch.mining.entity module

Classes and functions for entity extraction (aka named entity recognition).

class PatternCreator(storage=None)[source]

Bases: object

Utility class for easy handling of patterns.

Parameters

storage (None or pd.DataFrame) – If provided, we automatically populate _storage with it. If None, then we start from scratch - no patterns.

_storage

A representation of all patterns allows for comfortable sorting, filtering, etc. Note that each row represents a single pattern.

Type

pd.DataFrame

Examples

>>> from bluesearch.mining import PatternCreator
>>>
>>> pc = PatternCreator()
>>> pc.add("FOOD", [{"LOWER": "oreo"}])
>>> pc.add("DRINK", [{"LOWER": {"REGEX": "^w"}}, {"LOWER": "milk"}])
>>> doc = pc("It is necessary to dip the oreo in warm milk!")
>>> [(str(e), e.label_) for e in doc.ents]
[('oreo', 'FOOD'), ('warm milk', 'DRINK')]
add(label, pattern, check_exists=True)[source]

Add a single raw in the patterns.

Parameters
  • label (str) – Entity type to associate with a given pattern.

  • pattern (str or dict or list) –

    The pattern we want to match. The behavior depends on the type.

    • str: can be used for exact matching (case sensitive). We internally convert it to a single-token pattern {“TEXT”: pattern}.

    • dict: a single-token pattern. This dictionary can contain at most 2 entries. The first one represents the attribute: value pair (“LEMMA”: “world”). The second has a key “OP” and is optional. It represents the operator/quantifier to be used. An example of a valid pattern dict is {“LEMMA”: “world”, “OP”: “+”}. Note that it would detect entities like “world” and “world world world”.

    • list: a multi-token pattern. A list of dictionaries that are of the same form as described above.

  • check_exists (bool) – If True, we only allow to add patterns that do not exist yet.

drop(labels)[source]

Drop one or multiple patterns.

Parameters

labels (int or list) – If int then represent a row index to be dropped. If list then a collection of row indices to be dropped.

classmethod from_jsonl(path)[source]

Load from a JSONL file.

Parameters

path (pathlib.Path) – Path to a JSONL file with patterns.

Returns

pattern_creator – Instance of a PatternCreator.

Return type

bluesearch.mining.PatternCreator

static raw2row(raw)[source]

Convert an element of patterns list to a pd.Series.

The goal of this function is to create a pd.Series with all entries being strings. This will allow us to check for duplicates between different rows really quickly.

Parameters

raw (dict) – Dictionary with two keys: “label” and “pattern”. The pattern needs to be a list of dictionaries each representing a pattern for a given token. The label is a string representing the entity type.

Returns

row – The index contains the following elements: “label”, “attribute_0”, “value_0”, “value_type_0”, “op_0”, “attribute_1”, “value_1”, “value_type_1”, “op_1”, …

Return type

pd.Series

static row2raw(row)[source]

Convert pd.Series to a valid pattern dictionary.

Note that the value_{i} is always a string, however, we cast it to value_type_{i} type. In most cases the type will be int, str or dict. Since this casting is done dynamically we use eval.

Parameters

row (pd.Series) – The index contains the following elements: “label”, “attribute_0”, “value_0”, “value_type_0”, “op_0”, “attribute_1”, “value_1”, “value_type_1”, “op_1”,

Returns

raw – Dictionary with two keys: “label” and “pattern”. The pattern needs to be a list of dictionaries each representing a pattern for a given token. The label is a string representing the entity type.

Return type

dict

to_df()[source]

Convert to a pd.DataFrame.

Returns

Copy of the _storage. Each row represents a single entity type pattern. All elements are strings.

Return type

pd.DataFrame

to_jsonl(path, sort_by=None)[source]

Save to JSONL.

Parameters
  • path (pathlib.Path) – File where to save it.

  • sort_by (None or list) – If None, then no sorting taking place. If list, then the names of columns along which to sort.

to_list(sort_by=None)[source]

Convert to a list.

Parameters

sort_by (None or list) – If None, then no sorting taking place. If list, then the names of columns along which to sort.

Returns

A list where each element represents one entity type pattern. Note that this list can be directly passed into the EntityRuler.

Return type

list

check_patterns_agree(model, patterns)[source]

Validate whether patterns of an existing model agree with given patterns.

Parameters
  • model (spacy.Language) – A model that contains an EntityRuler.

  • patterns (list) – List of patterns.

Returns

res – If True, the patterns agree.

Return type

bool

Raises

ValueError – The model does not contain an entity ruler or it contains more than 1.

global2model_patterns(patterns, entity_type)[source]

Remap entity types in the patterns to a specific model.

For each entity type in the patterns try to see whether the model supports it and if not relabel the entity type to NaE.

Parameters
  • patterns (list) – List of patterns.

  • entity_type (str) – Entity type detected by a spacy model.

Returns

adjusted_patterns – Patterns that are supposed to be for a specific spacy model.

Return type

list