bluesearch.mining package¶
Submodules¶
Module contents¶
Subpackage for text mining.
- class AttributeAnnotationTab(**kwargs)[source]¶
Bases:
ipywidgets.widgets.widget_selectioncontainer.Tab
A tab widget for displaying attribute extractions.
It is a subclass of the ipywidgets.Tab class and contains the following four tabs: - Raw Text - Named Entites - Attributes - Table
- class AttributeExtractor(core_nlp_url, grobid_quantities_url, ee_model)[source]¶
Bases:
object
Extract and analyze attributes in a given text.
- static annotate_quantities(text, measurements)[source]¶
Annotate measurements in text using HTML/CSS styles.
- Parameters
text (str) – The text to annotate.
measurements (list) – The Grobid measurements for the text. It is assumed that these measurements were obtained by calling get_grobid_measurements(text).
- Returns
output – The annotated text.
- Return type
IPython.core.display.HTML
- are_linked(measurement, entity, core_nlp_sentence)[source]¶
Determine if a measurement and an entity are link.
- Parameters
measurement (dict) – A Grobid measurement.
entity (spacy.tokens.Span) – A spacy named entity.
core_nlp_sentence (dict) – A CoreNLP sentences. The CoreNLP sentences can be obtained from core_nlp_response[“sentences”].
- Returns
have_common_parents – Whether or not the entity is linked to the measurement.
- Return type
bool
- count_measurement_types(measurements)[source]¶
Count types of all given measurements.
- Parameters
measurements (list) – A list of Grobid measurements.
- Returns
all_type_counts – The counts of all measurement types.
- Return type
collections.Counter
- extract_attributes(text, linked_attributes_only=True, raw_attributes=False)[source]¶
Extract attributes from text.
- Parameters
text (str) – The text for attribute extraction.
linked_attributes_only (bool) – If true then only those attributes will be recorded for which there is an associated named entity.
raw_attributes (bool) – If true then the resulting data frame will contain all attribute information in one single column with raw grobid measurements. If false then the raw data frame will be processed using process_raw_annotation_df
- Returns
df – A pandas data frame with extracted attributes.
- Return type
pd.DataFrame
- find_all_parents(dependencies, tokens_d, tokens, parent_fn=None)[source]¶
Find all parents of a given CoreNLP token.
- Parameters
dependencies (list) – CoreNLP dependencies found in response[‘sentences’][idx][[‘basicDependencies’]
tokens_d (dict) – CoreNLP token dictionary mapping token indices to tokens. See extract_attributes.
tokens (list) – List of token indices for which parents need to be found.
parent_fn (function) – An implementation of a parent finding strategy. Currently the available strategies are find_compound_parents and find_nn_parents. The latter seems to perform better.
- Returns
parent_ids – A list of all parents found under the given strategy for the tokens provided.
- Return type
list
- find_nn_parents(dependencies, tokens_d, token_idx)[source]¶
Parse CoreNLP dependencies to find parents of token.
To link named entities to attributes parents for both entity tokens and attribute tokens need to be extracted. See extract_attributes for more information
This is one possible strategy for finding parents of a given token. Ascent the dependency tree until find a parent of type “NN”. Do this for all parents. If, as it seems, each node has at most one parent, then the results will be either one index or no indices.
- Parameters
dependencies (list) – CoreNLP dependencies found in response[‘sentences’][idx][[‘basicDependencies’]
tokens_d (dict) – CoreNLP token dictionary mapping token indices to tokens. See extract_attributes.
token_idx (int) – The index of the token for which parents need to be found.
- Returns
parents – A list of parents.
- Return type
list
- get_core_nlp_analysis(text)[source]¶
Send a CoreNLP query and return the result.
- Parameters
text (str) – The text to analyze with CoreNLP.
- Returns
response_json – The CoreNLP response.
- Return type
dict
- get_entity_tokens(entity, tokens)[source]¶
Associate a spacy entity to CoreNLP tokens.
- Parameters
entity (spacy.tokens.Span) – A spacy entity extracted from the text. See extract_attributes for more details.
tokens (list) – CoreNLP tokens.
- Returns
ids – A list of CoreNLP token IDs corresponding to the given entity.
- Return type
list
- get_grobid_measurements(text)[source]¶
Get measurements for text form Grobid server.
- Parameters
text (str) – The text for the query.
- Returns
measurements – All Grobid measurements extracted from the given text.
- Return type
list_like
- get_measurement_tokens(measurement, tokens)[source]¶
Associate a Grobid measurement to CoreNLP tokens.
See get_quantity_tokens for more details.
- Parameters
measurement (dict) – A Grobid measurement.
tokens (list) – CoreNLP tokens.
- Returns
ids – A list of CoreNLP token IDs corresponding to the given quantity.
- Return type
list
- get_measurement_type(measurement)[source]¶
Get the type of a Grobid measurement.
For measurements with multiple quantities the most common type is returned. In case of ties the empty type always loses.
- Parameters
measurement (dict) – A Grobid measurement.
- Returns
measurement_type – The type of the Grobid measurement.
- Return type
str
- static get_overlapping_token_ids(start, end, tokens)[source]¶
Find tokens intersecting the interval [start, end).
CoreNLP breaks a given text down into sentences, and each sentence is broken down into tokens. These can be accessed by response[‘sentences’][sentence_id][‘tokens’].
Each token corresponds to a position in the original text. This method determines which tokens would intersect a a given slice of this text.
- Parameters
start (int) – The left boundary of the interval.
end (int) – The right boundary of the interval.
tokens (list) – The CoreNLP sentence tokens.
- Returns
ids – A list of token indices that overlap with the given interval.
- Return type
list
- get_quantity_tokens(quantity, tokens)[source]¶
Associate a Grobid quantity to CoreNLP tokens.
Both the quantity and the tokens should originate from exactly the same text.
A quantity may be composed of multiple parts, e.g. a number and a unit, and therefore correspond to multiple CoreNLP tokens.
- Parameters
quantity (dict) – A Grobid quantity.
tokens (list) – CoreNLP tokens.
- Returns
ids – A list of CoreNLP token IDs corresponding to the given quantity.
- Return type
list
- static get_quantity_type(quantity)[source]¶
Get the type of a Grobid quantity.
The top-level Grobid object is a measurement. A measurement can contain one or more than one quantities.
Some Grobid quantities have a type attached to them, e.g. “mass”, “concentration”, etc. This is the type that is returned. For quantities without a type an empty string is returned.
- Parameters
quantity (dict) – A Grobid quantity.
- Returns
quantity_type – The type of the quantity.
- Return type
str
- static iter_parents(dependencies, token_idx)[source]¶
Iterate over all parents of a token.
It seems that each node has at most one parent, and that parent == 0 means no parent
- Parameters
dependencies (list) – CoreNLP dependencies found in response[‘sentences’][idx][[‘basicDependencies’].
token_idx (int) – The index of the token for which parents need to be iterated.
- Yields
parent_idx (int) – The index of a parent token.
- static iter_quantities(measurement)[source]¶
Iterate over quantities in a Grobid measurement.
- Parameters
measurement (dict) – A Grobid measurement.
- Yields
quantity (dict) – A Grobid quantity in the given measurement.
- measurement_to_str(measurement)[source]¶
Convert a Grobid measurement to string.
- Parameters
measurement (dict) – A Grobid measurement.
- Returns
quantities – String representations of quantities in a measurement. If the measurement contains only one quantity then its string representation is return as is. Otherwise a list of string representations of quantities is returned.
- Return type
list or str
- process_raw_annotation_df(df, copy=True)[source]¶
Add standard columns to attribute data frame.
- Parameters
df (pd.DataFrame) – A data frame with measurements in a raw format. This can be obtained by calling extract_attributes with the parameter raw_attributes=True.
copy (bool) – If true then it is guaranteed that the original data frame won’t be modified.
- Returns
df – A modified data frame with the raw attribute column replaced by a number of more explicit columns using the standard nomenclature.
- Return type
pd.DataFrame
- class ChemProt(model_path)[source]¶
Bases:
bluesearch.mining.relation.REModel
Pretrained model extracting 13 relations between chemicals and proteins.
- This model supports the following entity types:
“GGP”
“CHEBI”
- model_¶
The actual model in the backend.
- Type
allennlp.predictors.text_classifier.TextClassifierPredictor
Notes
This model depends on a package named scibert which is not specified in the setup.py since it introduces dependency conflicts. One can install it manually with the following command.
pip install git+https://github.com/allenai/scibert
Note that import scibert has a side effect of registering the “text_classifier” model with allennlp. This is done via applying a decorator to a class. For more details see
- property classes¶
Names of supported relation classes.
- property symbols¶
Symbols for annotation.
- class PatternCreator(storage=None)[source]¶
Bases:
object
Utility class for easy handling of patterns.
- Parameters
storage (None or pd.DataFrame) – If provided, we automatically populate _storage with it. If None, then we start from scratch - no patterns.
- _storage¶
A representation of all patterns allows for comfortable sorting, filtering, etc. Note that each row represents a single pattern.
- Type
pd.DataFrame
Examples
>>> from bluesearch.mining import PatternCreator >>> >>> pc = PatternCreator() >>> pc.add("FOOD", [{"LOWER": "oreo"}]) >>> pc.add("DRINK", [{"LOWER": {"REGEX": "^w"}}, {"LOWER": "milk"}]) >>> doc = pc("It is necessary to dip the oreo in warm milk!") >>> [(str(e), e.label_) for e in doc.ents] [('oreo', 'FOOD'), ('warm milk', 'DRINK')]
- add(label, pattern, check_exists=True)[source]¶
Add a single raw in the patterns.
- Parameters
label (str) – Entity type to associate with a given pattern.
pattern (str or dict or list) –
The pattern we want to match. The behavior depends on the type.
str
: can be used for exact matching (case sensitive). We internally convert it to a single-token pattern {“TEXT”: pattern}.dict
: a single-token pattern. This dictionary can contain at most 2 entries. The first one represents the attribute: value pair (“LEMMA”: “world”). The second has a key “OP” and is optional. It represents the operator/quantifier to be used. An example of a valid pattern dict is {“LEMMA”: “world”, “OP”: “+”}. Note that it would detect entities like “world” and “world world world”.list
: a multi-token pattern. A list of dictionaries that are of the same form as described above.
check_exists (bool) – If True, we only allow to add patterns that do not exist yet.
- drop(labels)[source]¶
Drop one or multiple patterns.
- Parameters
labels (int or list) – If
int
then represent a row index to be dropped. Iflist
then a collection of row indices to be dropped.
- classmethod from_jsonl(path)[source]¶
Load from a JSONL file.
- Parameters
path (pathlib.Path) – Path to a JSONL file with patterns.
- Returns
pattern_creator – Instance of a
PatternCreator
.- Return type
- static raw2row(raw)[source]¶
Convert an element of patterns list to a pd.Series.
The goal of this function is to create a pd.Series with all entries being strings. This will allow us to check for duplicates between different rows really quickly.
- Parameters
raw (dict) – Dictionary with two keys: “label” and “pattern”. The pattern needs to be a list of dictionaries each representing a pattern for a given token. The label is a string representing the entity type.
- Returns
row – The index contains the following elements: “label”, “attribute_0”, “value_0”, “value_type_0”, “op_0”, “attribute_1”, “value_1”, “value_type_1”, “op_1”, …
- Return type
pd.Series
- static row2raw(row)[source]¶
Convert pd.Series to a valid pattern dictionary.
Note that the value_{i} is always a string, however, we cast it to value_type_{i} type. In most cases the type will be
int
,str
ordict
. Since this casting is done dynamically we use eval.- Parameters
row (pd.Series) – The index contains the following elements: “label”, “attribute_0”, “value_0”, “value_type_0”, “op_0”, “attribute_1”, “value_1”, “value_type_1”, “op_1”,
- Returns
raw – Dictionary with two keys: “label” and “pattern”. The pattern needs to be a list of dictionaries each representing a pattern for a given token. The label is a string representing the entity type.
- Return type
dict
- to_df()[source]¶
Convert to a pd.DataFrame.
- Returns
Copy of the _storage. Each row represents a single entity type pattern. All elements are strings.
- Return type
pd.DataFrame
- to_jsonl(path, sort_by=None)[source]¶
Save to JSONL.
- Parameters
path (pathlib.Path) – File where to save it.
sort_by (None or list) – If None, then no sorting taking place. If
list
, then the names of columns along which to sort.
- to_list(sort_by=None)[source]¶
Convert to a list.
- Parameters
sort_by (None or list) – If None, then no sorting taking place. If
list
, then the names of columns along which to sort.- Returns
A list where each element represents one entity type pattern. Note that this list can be directly passed into the EntityRuler.
- Return type
list
- class REModel[source]¶
Bases:
abc.ABC
Abstract interface for relationship extraction models.
Inspired by SciBERT.
- abstract property classes¶
Names of supported relation classes.
- Returns
Names of supported relation classes.
- Return type
list of str
- predict(annotated_sentence, return_prob=False)[source]¶
Predict most likely relation between subject and object.
- Parameters
annotated_sentence (str) – Sentence with exactly 2 entities being annotated accordingly. For example “<< Cytarabine >> inhibits [[ DNA polymerase ]].”
return_prob (bool, optional) – If True also returns the confidence of the predicted relation.
- Returns
relation (str) – Relation type.
prob (float, optional) – Confidence of the predicted relation.
- abstract predict_probs(annotated_sentence)[source]¶
Relation probabilities between subject and object.
Predict per-class probabilities for the relation between subject and object in an annotated sentence.
- Parameters
annotated_sentence (str) – Sentence with exactly 2 entities being annotated accordingly. For example “<< Cytarabine >> inhibits [[ DNA polymerase ]].”
- Returns
relation_probs – Per-class probability vector. The index contains the class names, the values are the probabilities.
- Return type
pd.Series
- abstract property symbols¶
Generate dictionary mapping the two entity types to their annotation symbols.
General structure: {‘ENTITY_TYPE’: (‘SYMBOL_LEFT’, ‘SYMBOL_RIGHT’)} Specific example: {‘GGP’: (‘[[ ‘, ‘ ]]’), ‘CHEBI’: (‘<< ‘, ‘ >>’)}
Make sure that left and right symbols are not identical.
- class StartWithTheSameLetter[source]¶
Bases:
bluesearch.mining.relation.REModel
Check whether two entities start with the same letter (case insensitive).
This relation is symmetric and works on any entity type.
- property classes¶
Names of supported relation classes.
- property symbols¶
Symbols for annotation.
- class TextCollectionWidget(**kwargs)[source]¶
Bases:
ipywidgets.widgets.widget_box.VBox
A widget displaying annotations for a number o texts.
The text can be selected using a slider and the annotation results will be displayed in an AttributeAnnotationTab widget.
- annotate(doc, sent, ent_1, ent_2, etype_symbols)[source]¶
Annotate sentence given two entities.
- Parameters
doc (spacy.tokens.Doc) – The entire document (input text). Note that spacy uses it for absolute referencing.
sent (spacy.tokens.Span) – One sentence from the doc where we look for relations.
ent_1 (spacy.tokens.Span) – The first entity in the sentence. One can get its type by using the label_ attribute.
ent_2 (spacy.tokens.Span) – The second entity in the sentence. One can get its type by using the label_ attribute.
etype_symbols (dict or defaultdict) – Keys represent different entity types (“GGP”, “CHEBI”) and the values are tuples of size 2. Each of these tuples represents the starting and ending symbol to wrap the recognized entity with. Each
REModel
has the symbols property that encodes how its inputs should be annotated.
- Returns
result – String representing an annotated sentence created out of the original one.
- Return type
str
Notes
The implementation is non-trivial because an entity can span multiple words.
- annotations2df(annots_files, not_entity_symbol='O')[source]¶
Convert prodigy annotations in JSONL format into a pd.DataFrame.
- Parameters
annots_files (str, list of str, path or list of path) – Name of the annotation file(s) to load.
not_entity_symbol (str) – A symbol to use for tokens that are not an entity.
- Returns
final_table – Each row represents one token, the columns are ‘source’, ‘sentence_id’, ‘class’, ‘start_char’, end_char’, ‘id’, ‘text’.
- Return type
pd.DataFrame
- check_patterns_agree(model, patterns)[source]¶
Validate whether patterns of an existing model agree with given patterns.
- Parameters
model (spacy.Language) – A model that contains an EntityRuler.
patterns (list) – List of patterns.
- Returns
res – If True, the patterns agree.
- Return type
bool
- Raises
ValueError – The model does not contain an entity ruler or it contains more than 1.
- global2model_patterns(patterns, entity_type)[source]¶
Remap entity types in the patterns to a specific model.
For each entity type in the patterns try to see whether the model supports it and if not relabel the entity type to NaE.
- Parameters
patterns (list) – List of patterns.
entity_type (str) – Entity type detected by a spacy model.
- Returns
adjusted_patterns – Patterns that are supposed to be for a specific spacy model.
- Return type
list
- run_pipeline(texts, model_entities, models_relations, debug=False, excluded_entity_type='NaE')[source]¶
Run end-to-end extractions.
- Parameters
texts (iterable) –
The elements in texts are tuples where the first element is the text to be processed and the second element is a dictionary with arbitrary metadata for the text. Each key in this dictionary will be used to construct a new column in the output data frame and the values will appear in the corresponding rows.
Note that if debug=False then the output data frame will have exactly the columns specified by SPECS. That means that some columns produced by the entries in metadata might be dropped, and some empty columns might be added.
model_entities (spacy.lang.en.English) – Spacy model. Note that this model defines entity types.
models_relations (dict) – The keys are pairs (two element tuples) of entity types (i.e. (‘GGP’, ‘CHEBI’)). The first entity type is the subject and the second one is the object. Note that the entity types should correspond to those inside of model_entities. The value is a list of instances of relation extraction models, that is instances of some subclass of
REModel
.debug (bool) – If True, columns are not necessarily matching the specification. However, they contain debugging information. If False, then matching exactly the specification.
excluded_entity_type (str or None) – If a str, then all entities with type not_entity_label will be excluded. If None, then no exclusion will be taking place.
- Returns
The final table. If debug=True then it contains all the metadata. If False then it only contains columns in the official specification.
- Return type
pd.DataFrame
- spacy2df(spacy_model, ground_truth_tokenization, not_entity_symbol='O', excluded_entity_type='NaE')[source]¶
Turn NER of a spacy model into a pd.DataFrame.
- Parameters
spacy_model (spacy.language.Language) – Spacy model that will be used for NER, EntityRuler and Tagger (not tokenization). Note that a Tagger might be necessary for tagger EntityRuler.
ground_truth_tokenization (list) – List of str (words) representing the ground truth tokenization. This will guarantee that the ground truth dataframe will be aligned with the prediction dataframe.
not_entity_symbol (str) – A symbol to use for tokens that are not a part of any entity. Note that this symbol will be used for all tokens for which the ent_iob_ attribute of spacy.Token is equal to “O”.
excluded_entity_type (str or None) – Entity type that is going to be automatically excluded. Note that it is different from not_entity_symbol since it corresponds to the label_ attribute of
spacy.Span
objects. If None, then no exclusion will be taking place.
- Returns
Each row represents one token, the columns are ‘text’ and ‘class’.
- Return type
pd.DataFrame
Notes
One should run the annotations2df first in order to obtain the ground_truth_tokenization. If it is the case then ground_truth_tokenization=prodigy_table[‘text’].to_list().