bluesearch.mining.relation module¶
Classes and functions for relation extraction.
- class ChemProt(model_path)[source]¶
Bases:
bluesearch.mining.relation.REModel
Pretrained model extracting 13 relations between chemicals and proteins.
- This model supports the following entity types:
“GGP”
“CHEBI”
- model_¶
The actual model in the backend.
- Type
allennlp.predictors.text_classifier.TextClassifierPredictor
Notes
This model depends on a package named scibert which is not specified in the setup.py since it introduces dependency conflicts. One can install it manually with the following command.
pip install git+https://github.com/allenai/scibert
Note that import scibert has a side effect of registering the “text_classifier” model with allennlp. This is done via applying a decorator to a class. For more details see
- property classes¶
Names of supported relation classes.
- property symbols¶
Symbols for annotation.
- class REModel[source]¶
Bases:
abc.ABC
Abstract interface for relationship extraction models.
Inspired by SciBERT.
- abstract property classes¶
Names of supported relation classes.
- Returns
Names of supported relation classes.
- Return type
list of str
- predict(annotated_sentence, return_prob=False)[source]¶
Predict most likely relation between subject and object.
- Parameters
annotated_sentence (str) – Sentence with exactly 2 entities being annotated accordingly. For example “<< Cytarabine >> inhibits [[ DNA polymerase ]].”
return_prob (bool, optional) – If True also returns the confidence of the predicted relation.
- Returns
relation (str) – Relation type.
prob (float, optional) – Confidence of the predicted relation.
- abstract predict_probs(annotated_sentence)[source]¶
Relation probabilities between subject and object.
Predict per-class probabilities for the relation between subject and object in an annotated sentence.
- Parameters
annotated_sentence (str) – Sentence with exactly 2 entities being annotated accordingly. For example “<< Cytarabine >> inhibits [[ DNA polymerase ]].”
- Returns
relation_probs – Per-class probability vector. The index contains the class names, the values are the probabilities.
- Return type
pd.Series
- abstract property symbols¶
Generate dictionary mapping the two entity types to their annotation symbols.
General structure: {‘ENTITY_TYPE’: (‘SYMBOL_LEFT’, ‘SYMBOL_RIGHT’)} Specific example: {‘GGP’: (‘[[ ‘, ‘ ]]’), ‘CHEBI’: (‘<< ‘, ‘ >>’)}
Make sure that left and right symbols are not identical.
- class StartWithTheSameLetter[source]¶
Bases:
bluesearch.mining.relation.REModel
Check whether two entities start with the same letter (case insensitive).
This relation is symmetric and works on any entity type.
- property classes¶
Names of supported relation classes.
- property symbols¶
Symbols for annotation.
- annotate(doc, sent, ent_1, ent_2, etype_symbols)[source]¶
Annotate sentence given two entities.
- Parameters
doc (spacy.tokens.Doc) – The entire document (input text). Note that spacy uses it for absolute referencing.
sent (spacy.tokens.Span) – One sentence from the doc where we look for relations.
ent_1 (spacy.tokens.Span) – The first entity in the sentence. One can get its type by using the label_ attribute.
ent_2 (spacy.tokens.Span) – The second entity in the sentence. One can get its type by using the label_ attribute.
etype_symbols (dict or defaultdict) – Keys represent different entity types (“GGP”, “CHEBI”) and the values are tuples of size 2. Each of these tuples represents the starting and ending symbol to wrap the recognized entity with. Each
REModel
has the symbols property that encodes how its inputs should be annotated.
- Returns
result – String representing an annotated sentence created out of the original one.
- Return type
str
Notes
The implementation is non-trivial because an entity can span multiple words.