bluesearch.mining.eval module

Classes and functions for evaluating mining models predictions.

annotations2df(annots_files, not_entity_symbol='O')[source]

Convert prodigy annotations in JSONL format into a pd.DataFrame.

Parameters
  • annots_files (str, list of str, path or list of path) – Name of the annotation file(s) to load.

  • not_entity_symbol (str) – A symbol to use for tokens that are not an entity.

Returns

final_table – Each row represents one token, the columns are ‘source’, ‘sentence_id’, ‘class’, ‘start_char’, end_char’, ‘id’, ‘text’.

Return type

pd.DataFrame

idx2text(tokens, idxs)[source]

Retrieve entities text from a list of tokens and start and end indices.

Parameters
  • tokens (pd.Series[str]) – Tokens obtained from tokenization of a text.

  • idxs (pd.Series[int, int]) – Dataframe with 2 columns, ‘start’ and ‘end’, representing start and end position of the entities of the specified entity type.

Returns

texts – Texts of each entity identified by the indices provided in input.

Return type

pd.Series[str]

iob2idx(iob, etype)[source]

Retrieve start and end indices of entities from annotations in IOB2 format.

Parameters
  • iob (pd.Series[str]) – Annotations in the IOB2 format. Elements of the pd.Series should be either ‘O’, ‘B-ENTITY_TYPE’, or ‘I-ENTITY_TYPE’, where ‘ENTITY_TYPE’ is the name of some entity type.

  • etype (str) – Name of the entity type of interest.

Returns

idxs – Dataframe with 2 columns, ‘start’ and ‘end’, representing start and end position of the entities of the specified entity type.

Return type

pd.DataFrame[int, int]

ner_confusion_matrix(iob_true: pandas.core.series.Series, iob_pred: pandas.core.series.Series, normalize: Optional[str] = None, mode: str = 'entity') pandas.core.frame.DataFrame[source]

Compute confusion matrix to evaluate the accuracy of a NER model.

Evaluation is performed according to the definitions of “errors” from [1].

Parameters
  • iob_true – Ground truth (correct) IOB2 annotations.

  • iob_pred – Predicted IOB2 annotations.

  • normalize – One of “true”, “pred”, “all”, or None. Normalizes confusion matrix over the true (rows), predicted (columns) conditions or all the population. If None, the confusion matrix will not be normalized.

  • mode – Evaluation mode. One of ‘entity’, ‘token’: notice that an ‘entity’ can span several tokens.

Returns

cm – Dataframe where the index contains the ground truth entity types and the columns contain the predicted entity types.

Return type

pd.DataFrame

References

[1] Segura-Bedmar et al. 2013, “Semeval-2013 task 9: Extraction of drug-drug interactions from biomedical texts”, https://e-archivo.uc3m.es/handle/10016/20455

ner_errors(iob_true: pandas.core.series.Series, iob_pred: pandas.core.series.Series, tokens: pandas.core.series.Series, mode: str = 'entity', etypes_map: Optional[dict] = None, return_dict: bool = False) Union[str, collections.OrderedDict][source]

Build a summary report for the named entity recognition.

False positives and false negatives for each entity type are collected. Evaluation is performed according to the definitions of “errors” from [1].

Parameters
  • iob_true – Ground truth (correct) IOB2 annotations.

  • iob_pred – Predicted IOB2 annotations.

  • tokens – Tokens obtained from tokenization of a text.

  • mode – Evaluation mode. One of ‘entity’, ‘token’: notice that an ‘entity’ can span several tokens.

  • etypes_map – Dictionary mapping entity type names in the ground truth annotations to the corresponding entity type names in the predicted annotations. Useful when entity types have different names in iob_true and iob_pred, e.g. ORGANISM in ground truth and TAXON in predictions.

  • return_dict – If True, return output as dict.

Returns

report – Text summary of the precision, recall, F1 score for each entity type. Dictionary returned if output_dict is True. Dictionary has the following structure

{'entity_type 1': {'false_neg': [entity, entity, ...],
                   'false_pos': [entity, entity, ...]},
 'entity_type 2': { ... },
  ...
}

Return type

Union[str, OrderedDict]

References

[1] Segura-Bedmar et al. 2013, “Semeval-2013 task 9: Extraction of drug-drug interactions from biomedical texts”, https://e-archivo.uc3m.es/handle/10016/20455

ner_report(iob_true: pandas.core.series.Series, iob_pred: pandas.core.series.Series, mode: str = 'entity', etypes_map: Optional[dict] = None, return_dict: bool = False) Union[str, collections.OrderedDict][source]

Build a summary report showing the main ner evaluation metrics.

Evaluation is performed according to the definitions of “errors” from [1].

Parameters
  • iob_true (pd.Series[str]) – Ground truth (correct) IOB2 annotations.

  • iob_pred (pd.Series[str]) – Predicted IOB2 annotations.

  • mode (str, optional) – Evaluation mode. One of ‘entity’, ‘token’: notice that an ‘entity’ can span several tokens.

  • etypes_map (dict, optional) – Dictionary mapping entity type names in the ground truth annotations to the corresponding entity type names in the predicted annotations. Useful when entity types have different names in iob_true and iob_pred, e.g. ORGANISM in ground truth and TAXON in predictions.

  • return_dict (bool, optional) – If True, return output as dict.

Returns

report – Text summary of the precision, recall, F1 score for each entity type. Dictionary returned if output_dict is True. Dictionary has the following structure

{'entity_type 1': {'precision':0.5,
             'recall':1.0,
             'f1-score':0.67,
             'support':1},
 'entity_type 2': { ... },
  ...
}

Return type

Union[str, OrderedDict]

References

[1] Segura-Bedmar et al. 2013, “Semeval-2013 task 9: Extraction of drug-drug interactions from biomedical texts”, https://e-archivo.uc3m.es/handle/10016/20455

remove_punctuation(df)[source]

Remove punctuation from a dataframe with tokens and entity annotations.

Important: this function should be called only after all the annotations have been loaded by calling annotations2df() and spacy2df().

Parameters

df (pd.DataFrame) – DataFrame with tokens and annotations, can be generated calling annotations2df() and spacy2df(). Should include a column “text” containing one token per row, and one or more columns of annotations in IOB2 format named as “class_XXX”.

Returns

df_cleaned – DataFrame with removed punctuation.

Return type

pd.DataFrame

spacy2df(spacy_model, ground_truth_tokenization, not_entity_symbol='O', excluded_entity_type='NaE')[source]

Turn NER of a spacy model into a pd.DataFrame.

Parameters
  • spacy_model (spacy.language.Language) – Spacy model that will be used for NER, EntityRuler and Tagger (not tokenization). Note that a Tagger might be necessary for tagger EntityRuler.

  • ground_truth_tokenization (list) – List of str (words) representing the ground truth tokenization. This will guarantee that the ground truth dataframe will be aligned with the prediction dataframe.

  • not_entity_symbol (str) – A symbol to use for tokens that are not a part of any entity. Note that this symbol will be used for all tokens for which the ent_iob_ attribute of spacy.Token is equal to “O”.

  • excluded_entity_type (str or None) – Entity type that is going to be automatically excluded. Note that it is different from not_entity_symbol since it corresponds to the label_ attribute of spacy.Span objects. If None, then no exclusion will be taking place.

Returns

Each row represents one token, the columns are ‘text’ and ‘class’.

Return type

pd.DataFrame

Notes

One should run the annotations2df first in order to obtain the ground_truth_tokenization. If it is the case then ground_truth_tokenization=prodigy_table[‘text’].to_list().

unique_etypes(iob, return_counts=False, mode='entity')[source]

Return the sorted unique entity types for annotations in IOB2 format.

Parameters
  • iob (pd.Series[str]) – Annotations in the IOB2 format. Elements of the pd.Series should be either ‘O’, ‘B-ENTITY_TYPE’, or ‘I-ENTITY_TYPE’, where ‘ENTITY_TYPE’ is the name of some entity type.

  • return_counts (bool, optional) – If True, also return the number of times each unique entity type appears in the input.

  • mode (str, optional) – Evaluation mode. One of ‘entity’, ‘token’: notice that an ‘entity’ can span several tokens.

Returns

  • unique (list[str]) – The sorted unique entity types.

  • unique_counts (list[int], optional) – The number of times each of the unique entity types comes up in the input. Only provided if return_counts is True.