bluesearch.mining.relation module

Classes and functions for relation extraction.

class ChemProt(model_path)[source]

Bases: bluesearch.mining.relation.REModel

Pretrained model extracting 13 relations between chemicals and proteins.

This model supports the following entity types:
  • “GGP”

  • “CHEBI”

model_

The actual model in the backend.

Type

allennlp.predictors.text_classifier.TextClassifierPredictor

Notes

This model depends on a package named scibert which is not specified in the setup.py since it introduces dependency conflicts. One can install it manually with the following command.

pip install git+https://github.com/allenai/scibert

Note that import scibert has a side effect of registering the “text_classifier” model with allennlp. This is done via applying a decorator to a class. For more details see

https://github.com/allenai/scibert/blob/06793f77d7278898159ed50da30d173cdc8fdea9/scibert/models/text_classifier.py#L14

property classes

Names of supported relation classes.

predict_probs(annotated_sentence)[source]

Predict probabilities for the relation.

property symbols

Symbols for annotation.

class REModel[source]

Bases: abc.ABC

Abstract interface for relationship extraction models.

Inspired by SciBERT.

abstract property classes

Names of supported relation classes.

Returns

Names of supported relation classes.

Return type

list of str

predict(annotated_sentence, return_prob=False)[source]

Predict most likely relation between subject and object.

Parameters
  • annotated_sentence (str) – Sentence with exactly 2 entities being annotated accordingly. For example “<< Cytarabine >> inhibits [[ DNA polymerase ]].”

  • return_prob (bool, optional) – If True also returns the confidence of the predicted relation.

Returns

  • relation (str) – Relation type.

  • prob (float, optional) – Confidence of the predicted relation.

abstract predict_probs(annotated_sentence)[source]

Relation probabilities between subject and object.

Predict per-class probabilities for the relation between subject and object in an annotated sentence.

Parameters

annotated_sentence (str) – Sentence with exactly 2 entities being annotated accordingly. For example “<< Cytarabine >> inhibits [[ DNA polymerase ]].”

Returns

relation_probs – Per-class probability vector. The index contains the class names, the values are the probabilities.

Return type

pd.Series

abstract property symbols

Generate dictionary mapping the two entity types to their annotation symbols.

General structure: {‘ENTITY_TYPE’: (‘SYMBOL_LEFT’, ‘SYMBOL_RIGHT’)} Specific example: {‘GGP’: (‘[[ ‘, ‘ ]]’), ‘CHEBI’: (‘<< ‘, ‘ >>’)}

Make sure that left and right symbols are not identical.

class StartWithTheSameLetter[source]

Bases: bluesearch.mining.relation.REModel

Check whether two entities start with the same letter (case insensitive).

This relation is symmetric and works on any entity type.

property classes

Names of supported relation classes.

predict_probs(annotated_sentence)[source]

Predict probabilities for the relation.

property symbols

Symbols for annotation.

annotate(doc, sent, ent_1, ent_2, etype_symbols)[source]

Annotate sentence given two entities.

Parameters
  • doc (spacy.tokens.Doc) – The entire document (input text). Note that spacy uses it for absolute referencing.

  • sent (spacy.tokens.Span) – One sentence from the doc where we look for relations.

  • ent_1 (spacy.tokens.Span) – The first entity in the sentence. One can get its type by using the label_ attribute.

  • ent_2 (spacy.tokens.Span) – The second entity in the sentence. One can get its type by using the label_ attribute.

  • etype_symbols (dict or defaultdict) – Keys represent different entity types (“GGP”, “CHEBI”) and the values are tuples of size 2. Each of these tuples represents the starting and ending symbol to wrap the recognized entity with. Each REModel has the symbols property that encodes how its inputs should be annotated.

Returns

result – String representing an annotated sentence created out of the original one.

Return type

str

Notes

The implementation is non-trivial because an entity can span multiple words.