bluesearch.mining package

Submodules

Module contents

Subpackage for text mining.

class AttributeAnnotationTab(**kwargs)[source]

Bases: ipywidgets.widgets.widget_selectioncontainer.Tab

A tab widget for displaying attribute extractions.

It is a subclass of the ipywidgets.Tab class and contains the following four tabs: - Raw Text - Named Entites - Attributes - Table

set_text(text)[source]

Set the text for the widget.

Parameters

text (str) – The text to assign to this widget.

class AttributeExtractor(core_nlp_url, grobid_quantities_url, ee_model)[source]

Bases: object

Extract and analyze attributes in a given text.

static annotate_quantities(text, measurements)[source]

Annotate measurements in text using HTML/CSS styles.

Parameters
  • text (str) – The text to annotate.

  • measurements (list) – The Grobid measurements for the text. It is assumed that these measurements were obtained by calling get_grobid_measurements(text).

Returns

output – The annotated text.

Return type

IPython.core.display.HTML

are_linked(measurement, entity, core_nlp_sentence)[source]

Determine if a measurement and an entity are link.

Parameters
  • measurement (dict) – A Grobid measurement.

  • entity (spacy.tokens.Span) – A spacy named entity.

  • core_nlp_sentence (dict) – A CoreNLP sentences. The CoreNLP sentences can be obtained from core_nlp_response[“sentences”].

Returns

have_common_parents – Whether or not the entity is linked to the measurement.

Return type

bool

count_measurement_types(measurements)[source]

Count types of all given measurements.

Parameters

measurements (list) – A list of Grobid measurements.

Returns

all_type_counts – The counts of all measurement types.

Return type

collections.Counter

extract_attributes(text, linked_attributes_only=True, raw_attributes=False)[source]

Extract attributes from text.

Parameters
  • text (str) – The text for attribute extraction.

  • linked_attributes_only (bool) – If true then only those attributes will be recorded for which there is an associated named entity.

  • raw_attributes (bool) – If true then the resulting data frame will contain all attribute information in one single column with raw grobid measurements. If false then the raw data frame will be processed using process_raw_annotation_df

Returns

df – A pandas data frame with extracted attributes.

Return type

pd.DataFrame

find_all_parents(dependencies, tokens_d, tokens, parent_fn=None)[source]

Find all parents of a given CoreNLP token.

Parameters
  • dependencies (list) – CoreNLP dependencies found in response[‘sentences’][idx][[‘basicDependencies’]

  • tokens_d (dict) – CoreNLP token dictionary mapping token indices to tokens. See extract_attributes.

  • tokens (list) – List of token indices for which parents need to be found.

  • parent_fn (function) – An implementation of a parent finding strategy. Currently the available strategies are find_compound_parents and find_nn_parents. The latter seems to perform better.

Returns

parent_ids – A list of all parents found under the given strategy for the tokens provided.

Return type

list

find_nn_parents(dependencies, tokens_d, token_idx)[source]

Parse CoreNLP dependencies to find parents of token.

To link named entities to attributes parents for both entity tokens and attribute tokens need to be extracted. See extract_attributes for more information

This is one possible strategy for finding parents of a given token. Ascent the dependency tree until find a parent of type “NN”. Do this for all parents. If, as it seems, each node has at most one parent, then the results will be either one index or no indices.

Parameters
  • dependencies (list) – CoreNLP dependencies found in response[‘sentences’][idx][[‘basicDependencies’]

  • tokens_d (dict) – CoreNLP token dictionary mapping token indices to tokens. See extract_attributes.

  • token_idx (int) – The index of the token for which parents need to be found.

Returns

parents – A list of parents.

Return type

list

get_core_nlp_analysis(text)[source]

Send a CoreNLP query and return the result.

Parameters

text (str) – The text to analyze with CoreNLP.

Returns

response_json – The CoreNLP response.

Return type

dict

get_entity_tokens(entity, tokens)[source]

Associate a spacy entity to CoreNLP tokens.

Parameters
  • entity (spacy.tokens.Span) – A spacy entity extracted from the text. See extract_attributes for more details.

  • tokens (list) – CoreNLP tokens.

Returns

ids – A list of CoreNLP token IDs corresponding to the given entity.

Return type

list

get_grobid_measurements(text)[source]

Get measurements for text form Grobid server.

Parameters

text (str) – The text for the query.

Returns

measurements – All Grobid measurements extracted from the given text.

Return type

list_like

get_measurement_tokens(measurement, tokens)[source]

Associate a Grobid measurement to CoreNLP tokens.

See get_quantity_tokens for more details.

Parameters
  • measurement (dict) – A Grobid measurement.

  • tokens (list) – CoreNLP tokens.

Returns

ids – A list of CoreNLP token IDs corresponding to the given quantity.

Return type

list

get_measurement_type(measurement)[source]

Get the type of a Grobid measurement.

For measurements with multiple quantities the most common type is returned. In case of ties the empty type always loses.

Parameters

measurement (dict) – A Grobid measurement.

Returns

measurement_type – The type of the Grobid measurement.

Return type

str

static get_overlapping_token_ids(start, end, tokens)[source]

Find tokens intersecting the interval [start, end).

CoreNLP breaks a given text down into sentences, and each sentence is broken down into tokens. These can be accessed by response[‘sentences’][sentence_id][‘tokens’].

Each token corresponds to a position in the original text. This method determines which tokens would intersect a a given slice of this text.

Parameters
  • start (int) – The left boundary of the interval.

  • end (int) – The right boundary of the interval.

  • tokens (list) – The CoreNLP sentence tokens.

Returns

ids – A list of token indices that overlap with the given interval.

Return type

list

get_quantity_tokens(quantity, tokens)[source]

Associate a Grobid quantity to CoreNLP tokens.

Both the quantity and the tokens should originate from exactly the same text.

A quantity may be composed of multiple parts, e.g. a number and a unit, and therefore correspond to multiple CoreNLP tokens.

Parameters
  • quantity (dict) – A Grobid quantity.

  • tokens (list) – CoreNLP tokens.

Returns

ids – A list of CoreNLP token IDs corresponding to the given quantity.

Return type

list

static get_quantity_type(quantity)[source]

Get the type of a Grobid quantity.

The top-level Grobid object is a measurement. A measurement can contain one or more than one quantities.

Some Grobid quantities have a type attached to them, e.g. “mass”, “concentration”, etc. This is the type that is returned. For quantities without a type an empty string is returned.

Parameters

quantity (dict) – A Grobid quantity.

Returns

quantity_type – The type of the quantity.

Return type

str

static iter_parents(dependencies, token_idx)[source]

Iterate over all parents of a token.

It seems that each node has at most one parent, and that parent == 0 means no parent

Parameters
  • dependencies (list) – CoreNLP dependencies found in response[‘sentences’][idx][[‘basicDependencies’].

  • token_idx (int) – The index of the token for which parents need to be iterated.

Yields

parent_idx (int) – The index of a parent token.

static iter_quantities(measurement)[source]

Iterate over quantities in a Grobid measurement.

Parameters

measurement (dict) – A Grobid measurement.

Yields

quantity (dict) – A Grobid quantity in the given measurement.

measurement_to_str(measurement)[source]

Convert a Grobid measurement to string.

Parameters

measurement (dict) – A Grobid measurement.

Returns

quantities – String representations of quantities in a measurement. If the measurement contains only one quantity then its string representation is return as is. Otherwise a list of string representations of quantities is returned.

Return type

list or str

process_raw_annotation_df(df, copy=True)[source]

Add standard columns to attribute data frame.

Parameters
  • df (pd.DataFrame) – A data frame with measurements in a raw format. This can be obtained by calling extract_attributes with the parameter raw_attributes=True.

  • copy (bool) – If true then it is guaranteed that the original data frame won’t be modified.

Returns

df – A modified data frame with the raw attribute column replaced by a number of more explicit columns using the standard nomenclature.

Return type

pd.DataFrame

static quantity_to_str(quantity)[source]

Convert a Grobid quantity to string.

Parameters

quantity (dict) – A Grobid quantity.

Returns

result – A String representation of the quantity.

Return type

str

class ChemProt(model_path)[source]

Bases: bluesearch.mining.relation.REModel

Pretrained model extracting 13 relations between chemicals and proteins.

This model supports the following entity types:
  • “GGP”

  • “CHEBI”

model_

The actual model in the backend.

Type

allennlp.predictors.text_classifier.TextClassifierPredictor

Notes

This model depends on a package named scibert which is not specified in the setup.py since it introduces dependency conflicts. One can install it manually with the following command.

pip install git+https://github.com/allenai/scibert

Note that import scibert has a side effect of registering the “text_classifier” model with allennlp. This is done via applying a decorator to a class. For more details see

https://github.com/allenai/scibert/blob/06793f77d7278898159ed50da30d173cdc8fdea9/scibert/models/text_classifier.py#L14

property classes

Names of supported relation classes.

predict_probs(annotated_sentence)[source]

Predict probabilities for the relation.

property symbols

Symbols for annotation.

class PatternCreator(storage=None)[source]

Bases: object

Utility class for easy handling of patterns.

Parameters

storage (None or pd.DataFrame) – If provided, we automatically populate _storage with it. If None, then we start from scratch - no patterns.

_storage

A representation of all patterns allows for comfortable sorting, filtering, etc. Note that each row represents a single pattern.

Type

pd.DataFrame

Examples

>>> from bluesearch.mining import PatternCreator
>>>
>>> pc = PatternCreator()
>>> pc.add("FOOD", [{"LOWER": "oreo"}])
>>> pc.add("DRINK", [{"LOWER": {"REGEX": "^w"}}, {"LOWER": "milk"}])
>>> doc = pc("It is necessary to dip the oreo in warm milk!")
>>> [(str(e), e.label_) for e in doc.ents]
[('oreo', 'FOOD'), ('warm milk', 'DRINK')]
add(label, pattern, check_exists=True)[source]

Add a single raw in the patterns.

Parameters
  • label (str) – Entity type to associate with a given pattern.

  • pattern (str or dict or list) –

    The pattern we want to match. The behavior depends on the type.

    • str: can be used for exact matching (case sensitive). We internally convert it to a single-token pattern {“TEXT”: pattern}.

    • dict: a single-token pattern. This dictionary can contain at most 2 entries. The first one represents the attribute: value pair (“LEMMA”: “world”). The second has a key “OP” and is optional. It represents the operator/quantifier to be used. An example of a valid pattern dict is {“LEMMA”: “world”, “OP”: “+”}. Note that it would detect entities like “world” and “world world world”.

    • list: a multi-token pattern. A list of dictionaries that are of the same form as described above.

  • check_exists (bool) – If True, we only allow to add patterns that do not exist yet.

drop(labels)[source]

Drop one or multiple patterns.

Parameters

labels (int or list) – If int then represent a row index to be dropped. If list then a collection of row indices to be dropped.

classmethod from_jsonl(path)[source]

Load from a JSONL file.

Parameters

path (pathlib.Path) – Path to a JSONL file with patterns.

Returns

pattern_creator – Instance of a PatternCreator.

Return type

bluesearch.mining.PatternCreator

static raw2row(raw)[source]

Convert an element of patterns list to a pd.Series.

The goal of this function is to create a pd.Series with all entries being strings. This will allow us to check for duplicates between different rows really quickly.

Parameters

raw (dict) – Dictionary with two keys: “label” and “pattern”. The pattern needs to be a list of dictionaries each representing a pattern for a given token. The label is a string representing the entity type.

Returns

row – The index contains the following elements: “label”, “attribute_0”, “value_0”, “value_type_0”, “op_0”, “attribute_1”, “value_1”, “value_type_1”, “op_1”, …

Return type

pd.Series

static row2raw(row)[source]

Convert pd.Series to a valid pattern dictionary.

Note that the value_{i} is always a string, however, we cast it to value_type_{i} type. In most cases the type will be int, str or dict. Since this casting is done dynamically we use eval.

Parameters

row (pd.Series) – The index contains the following elements: “label”, “attribute_0”, “value_0”, “value_type_0”, “op_0”, “attribute_1”, “value_1”, “value_type_1”, “op_1”,

Returns

raw – Dictionary with two keys: “label” and “pattern”. The pattern needs to be a list of dictionaries each representing a pattern for a given token. The label is a string representing the entity type.

Return type

dict

to_df()[source]

Convert to a pd.DataFrame.

Returns

Copy of the _storage. Each row represents a single entity type pattern. All elements are strings.

Return type

pd.DataFrame

to_jsonl(path, sort_by=None)[source]

Save to JSONL.

Parameters
  • path (pathlib.Path) – File where to save it.

  • sort_by (None or list) – If None, then no sorting taking place. If list, then the names of columns along which to sort.

to_list(sort_by=None)[source]

Convert to a list.

Parameters

sort_by (None or list) – If None, then no sorting taking place. If list, then the names of columns along which to sort.

Returns

A list where each element represents one entity type pattern. Note that this list can be directly passed into the EntityRuler.

Return type

list

class REModel[source]

Bases: abc.ABC

Abstract interface for relationship extraction models.

Inspired by SciBERT.

abstract property classes

Names of supported relation classes.

Returns

Names of supported relation classes.

Return type

list of str

predict(annotated_sentence, return_prob=False)[source]

Predict most likely relation between subject and object.

Parameters
  • annotated_sentence (str) – Sentence with exactly 2 entities being annotated accordingly. For example “<< Cytarabine >> inhibits [[ DNA polymerase ]].”

  • return_prob (bool, optional) – If True also returns the confidence of the predicted relation.

Returns

  • relation (str) – Relation type.

  • prob (float, optional) – Confidence of the predicted relation.

abstract predict_probs(annotated_sentence)[source]

Relation probabilities between subject and object.

Predict per-class probabilities for the relation between subject and object in an annotated sentence.

Parameters

annotated_sentence (str) – Sentence with exactly 2 entities being annotated accordingly. For example “<< Cytarabine >> inhibits [[ DNA polymerase ]].”

Returns

relation_probs – Per-class probability vector. The index contains the class names, the values are the probabilities.

Return type

pd.Series

abstract property symbols

Generate dictionary mapping the two entity types to their annotation symbols.

General structure: {‘ENTITY_TYPE’: (‘SYMBOL_LEFT’, ‘SYMBOL_RIGHT’)} Specific example: {‘GGP’: (‘[[ ‘, ‘ ]]’), ‘CHEBI’: (‘<< ‘, ‘ >>’)}

Make sure that left and right symbols are not identical.

class StartWithTheSameLetter[source]

Bases: bluesearch.mining.relation.REModel

Check whether two entities start with the same letter (case insensitive).

This relation is symmetric and works on any entity type.

property classes

Names of supported relation classes.

predict_probs(annotated_sentence)[source]

Predict probabilities for the relation.

property symbols

Symbols for annotation.

class TextCollectionWidget(**kwargs)[source]

Bases: ipywidgets.widgets.widget_box.VBox

A widget displaying annotations for a number o texts.

The text can be selected using a slider and the annotation results will be displayed in an AttributeAnnotationTab widget.

annotate(doc, sent, ent_1, ent_2, etype_symbols)[source]

Annotate sentence given two entities.

Parameters
  • doc (spacy.tokens.Doc) – The entire document (input text). Note that spacy uses it for absolute referencing.

  • sent (spacy.tokens.Span) – One sentence from the doc where we look for relations.

  • ent_1 (spacy.tokens.Span) – The first entity in the sentence. One can get its type by using the label_ attribute.

  • ent_2 (spacy.tokens.Span) – The second entity in the sentence. One can get its type by using the label_ attribute.

  • etype_symbols (dict or defaultdict) – Keys represent different entity types (“GGP”, “CHEBI”) and the values are tuples of size 2. Each of these tuples represents the starting and ending symbol to wrap the recognized entity with. Each REModel has the symbols property that encodes how its inputs should be annotated.

Returns

result – String representing an annotated sentence created out of the original one.

Return type

str

Notes

The implementation is non-trivial because an entity can span multiple words.

annotations2df(annots_files, not_entity_symbol='O')[source]

Convert prodigy annotations in JSONL format into a pd.DataFrame.

Parameters
  • annots_files (str, list of str, path or list of path) – Name of the annotation file(s) to load.

  • not_entity_symbol (str) – A symbol to use for tokens that are not an entity.

Returns

final_table – Each row represents one token, the columns are ‘source’, ‘sentence_id’, ‘class’, ‘start_char’, end_char’, ‘id’, ‘text’.

Return type

pd.DataFrame

check_patterns_agree(model, patterns)[source]

Validate whether patterns of an existing model agree with given patterns.

Parameters
  • model (spacy.Language) – A model that contains an EntityRuler.

  • patterns (list) – List of patterns.

Returns

res – If True, the patterns agree.

Return type

bool

Raises

ValueError – The model does not contain an entity ruler or it contains more than 1.

global2model_patterns(patterns, entity_type)[source]

Remap entity types in the patterns to a specific model.

For each entity type in the patterns try to see whether the model supports it and if not relabel the entity type to NaE.

Parameters
  • patterns (list) – List of patterns.

  • entity_type (str) – Entity type detected by a spacy model.

Returns

adjusted_patterns – Patterns that are supposed to be for a specific spacy model.

Return type

list

run_pipeline(texts, model_entities, models_relations, debug=False, excluded_entity_type='NaE')[source]

Run end-to-end extractions.

Parameters
  • texts (iterable) –

    The elements in texts are tuples where the first element is the text to be processed and the second element is a dictionary with arbitrary metadata for the text. Each key in this dictionary will be used to construct a new column in the output data frame and the values will appear in the corresponding rows.

    Note that if debug=False then the output data frame will have exactly the columns specified by SPECS. That means that some columns produced by the entries in metadata might be dropped, and some empty columns might be added.

  • model_entities (spacy.lang.en.English) – Spacy model. Note that this model defines entity types.

  • models_relations (dict) – The keys are pairs (two element tuples) of entity types (i.e. (‘GGP’, ‘CHEBI’)). The first entity type is the subject and the second one is the object. Note that the entity types should correspond to those inside of model_entities. The value is a list of instances of relation extraction models, that is instances of some subclass of REModel.

  • debug (bool) – If True, columns are not necessarily matching the specification. However, they contain debugging information. If False, then matching exactly the specification.

  • excluded_entity_type (str or None) – If a str, then all entities with type not_entity_label will be excluded. If None, then no exclusion will be taking place.

Returns

The final table. If debug=True then it contains all the metadata. If False then it only contains columns in the official specification.

Return type

pd.DataFrame

spacy2df(spacy_model, ground_truth_tokenization, not_entity_symbol='O', excluded_entity_type='NaE')[source]

Turn NER of a spacy model into a pd.DataFrame.

Parameters
  • spacy_model (spacy.language.Language) – Spacy model that will be used for NER, EntityRuler and Tagger (not tokenization). Note that a Tagger might be necessary for tagger EntityRuler.

  • ground_truth_tokenization (list) – List of str (words) representing the ground truth tokenization. This will guarantee that the ground truth dataframe will be aligned with the prediction dataframe.

  • not_entity_symbol (str) – A symbol to use for tokens that are not a part of any entity. Note that this symbol will be used for all tokens for which the ent_iob_ attribute of spacy.Token is equal to “O”.

  • excluded_entity_type (str or None) – Entity type that is going to be automatically excluded. Note that it is different from not_entity_symbol since it corresponds to the label_ attribute of spacy.Span objects. If None, then no exclusion will be taking place.

Returns

Each row represents one token, the columns are ‘text’ and ‘class’.

Return type

pd.DataFrame

Notes

One should run the annotations2df first in order to obtain the ground_truth_tokenization. If it is the case then ground_truth_tokenization=prodigy_table[‘text’].to_list().