bluesearch.mining.pipeline module

Complete pipeline to mine entities, relations, attributes from text.

run_pipeline(texts, model_entities, models_relations, debug=False, excluded_entity_type='NaE')[source]

Run end-to-end extractions.

Parameters
  • texts (iterable) –

    The elements in texts are tuples where the first element is the text to be processed and the second element is a dictionary with arbitrary metadata for the text. Each key in this dictionary will be used to construct a new column in the output data frame and the values will appear in the corresponding rows.

    Note that if debug=False then the output data frame will have exactly the columns specified by SPECS. That means that some columns produced by the entries in metadata might be dropped, and some empty columns might be added.

  • model_entities (spacy.lang.en.English) – Spacy model. Note that this model defines entity types.

  • models_relations (dict) – The keys are pairs (two element tuples) of entity types (i.e. (‘GGP’, ‘CHEBI’)). The first entity type is the subject and the second one is the object. Note that the entity types should correspond to those inside of model_entities. The value is a list of instances of relation extraction models, that is instances of some subclass of REModel.

  • debug (bool) – If True, columns are not necessarily matching the specification. However, they contain debugging information. If False, then matching exactly the specification.

  • excluded_entity_type (str or None) – If a str, then all entities with type not_entity_label will be excluded. If None, then no exclusion will be taking place.

Returns

The final table. If debug=True then it contains all the metadata. If False then it only contains columns in the official specification.

Return type

pd.DataFrame