bluesearch.mining.eval module¶
Classes and functions for evaluating mining models predictions.
- annotations2df(annots_files, not_entity_symbol='O')[source]¶
Convert prodigy annotations in JSONL format into a pd.DataFrame.
- Parameters
annots_files (str, list of str, path or list of path) – Name of the annotation file(s) to load.
not_entity_symbol (str) – A symbol to use for tokens that are not an entity.
- Returns
final_table – Each row represents one token, the columns are ‘source’, ‘sentence_id’, ‘class’, ‘start_char’, end_char’, ‘id’, ‘text’.
- Return type
pd.DataFrame
- idx2text(tokens, idxs)[source]¶
Retrieve entities text from a list of tokens and start and end indices.
- Parameters
tokens (pd.Series[str]) – Tokens obtained from tokenization of a text.
idxs (pd.Series[int, int]) – Dataframe with 2 columns, ‘start’ and ‘end’, representing start and end position of the entities of the specified entity type.
- Returns
texts – Texts of each entity identified by the indices provided in input.
- Return type
pd.Series[str]
- iob2idx(iob, etype)[source]¶
Retrieve start and end indices of entities from annotations in IOB2 format.
- Parameters
iob (pd.Series[str]) – Annotations in the IOB2 format. Elements of the pd.Series should be either ‘O’, ‘B-ENTITY_TYPE’, or ‘I-ENTITY_TYPE’, where ‘ENTITY_TYPE’ is the name of some entity type.
etype (str) – Name of the entity type of interest.
- Returns
idxs – Dataframe with 2 columns, ‘start’ and ‘end’, representing start and end position of the entities of the specified entity type.
- Return type
pd.DataFrame[int, int]
- ner_confusion_matrix(iob_true: pandas.core.series.Series, iob_pred: pandas.core.series.Series, normalize: Optional[str] = None, mode: str = 'entity') pandas.core.frame.DataFrame [source]¶
Compute confusion matrix to evaluate the accuracy of a NER model.
Evaluation is performed according to the definitions of “errors” from [1].
- Parameters
iob_true – Ground truth (correct) IOB2 annotations.
iob_pred – Predicted IOB2 annotations.
normalize – One of “true”, “pred”, “all”, or None. Normalizes confusion matrix over the true (rows), predicted (columns) conditions or all the population. If None, the confusion matrix will not be normalized.
mode – Evaluation mode. One of ‘entity’, ‘token’: notice that an ‘entity’ can span several tokens.
- Returns
cm – Dataframe where the index contains the ground truth entity types and the columns contain the predicted entity types.
- Return type
pd.DataFrame
References
[1] Segura-Bedmar et al. 2013, “Semeval-2013 task 9: Extraction of drug-drug interactions from biomedical texts”, https://e-archivo.uc3m.es/handle/10016/20455
- ner_errors(iob_true: pandas.core.series.Series, iob_pred: pandas.core.series.Series, tokens: pandas.core.series.Series, mode: str = 'entity', etypes_map: Optional[dict] = None, return_dict: bool = False) Union[str, collections.OrderedDict] [source]¶
Build a summary report for the named entity recognition.
False positives and false negatives for each entity type are collected. Evaluation is performed according to the definitions of “errors” from [1].
- Parameters
iob_true – Ground truth (correct) IOB2 annotations.
iob_pred – Predicted IOB2 annotations.
tokens – Tokens obtained from tokenization of a text.
mode – Evaluation mode. One of ‘entity’, ‘token’: notice that an ‘entity’ can span several tokens.
etypes_map – Dictionary mapping entity type names in the ground truth annotations to the corresponding entity type names in the predicted annotations. Useful when entity types have different names in iob_true and iob_pred, e.g. ORGANISM in ground truth and TAXON in predictions.
return_dict – If True, return output as dict.
- Returns
report – Text summary of the precision, recall, F1 score for each entity type. Dictionary returned if output_dict is True. Dictionary has the following structure
{'entity_type 1': {'false_neg': [entity, entity, ...], 'false_pos': [entity, entity, ...]}, 'entity_type 2': { ... }, ... }
- Return type
Union[str, OrderedDict]
References
[1] Segura-Bedmar et al. 2013, “Semeval-2013 task 9: Extraction of drug-drug interactions from biomedical texts”, https://e-archivo.uc3m.es/handle/10016/20455
- ner_report(iob_true: pandas.core.series.Series, iob_pred: pandas.core.series.Series, mode: str = 'entity', etypes_map: Optional[dict] = None, return_dict: bool = False) Union[str, collections.OrderedDict] [source]¶
Build a summary report showing the main ner evaluation metrics.
Evaluation is performed according to the definitions of “errors” from [1].
- Parameters
iob_true (pd.Series[str]) – Ground truth (correct) IOB2 annotations.
iob_pred (pd.Series[str]) – Predicted IOB2 annotations.
mode (str, optional) – Evaluation mode. One of ‘entity’, ‘token’: notice that an ‘entity’ can span several tokens.
etypes_map (dict, optional) – Dictionary mapping entity type names in the ground truth annotations to the corresponding entity type names in the predicted annotations. Useful when entity types have different names in iob_true and iob_pred, e.g. ORGANISM in ground truth and TAXON in predictions.
return_dict (bool, optional) – If True, return output as dict.
- Returns
report – Text summary of the precision, recall, F1 score for each entity type. Dictionary returned if output_dict is True. Dictionary has the following structure
{'entity_type 1': {'precision':0.5, 'recall':1.0, 'f1-score':0.67, 'support':1}, 'entity_type 2': { ... }, ... }
- Return type
Union[str, OrderedDict]
References
[1] Segura-Bedmar et al. 2013, “Semeval-2013 task 9: Extraction of drug-drug interactions from biomedical texts”, https://e-archivo.uc3m.es/handle/10016/20455
- remove_punctuation(df)[source]¶
Remove punctuation from a dataframe with tokens and entity annotations.
Important: this function should be called only after all the annotations have been loaded by calling annotations2df() and spacy2df().
- Parameters
df (pd.DataFrame) – DataFrame with tokens and annotations, can be generated calling annotations2df() and spacy2df(). Should include a column “text” containing one token per row, and one or more columns of annotations in IOB2 format named as “class_XXX”.
- Returns
df_cleaned – DataFrame with removed punctuation.
- Return type
pd.DataFrame
- spacy2df(spacy_model, ground_truth_tokenization, not_entity_symbol='O', excluded_entity_type='NaE')[source]¶
Turn NER of a spacy model into a pd.DataFrame.
- Parameters
spacy_model (spacy.language.Language) – Spacy model that will be used for NER, EntityRuler and Tagger (not tokenization). Note that a Tagger might be necessary for tagger EntityRuler.
ground_truth_tokenization (list) – List of str (words) representing the ground truth tokenization. This will guarantee that the ground truth dataframe will be aligned with the prediction dataframe.
not_entity_symbol (str) – A symbol to use for tokens that are not a part of any entity. Note that this symbol will be used for all tokens for which the ent_iob_ attribute of spacy.Token is equal to “O”.
excluded_entity_type (str or None) – Entity type that is going to be automatically excluded. Note that it is different from not_entity_symbol since it corresponds to the label_ attribute of
spacy.Span
objects. If None, then no exclusion will be taking place.
- Returns
Each row represents one token, the columns are ‘text’ and ‘class’.
- Return type
pd.DataFrame
Notes
One should run the annotations2df first in order to obtain the ground_truth_tokenization. If it is the case then ground_truth_tokenization=prodigy_table[‘text’].to_list().
- unique_etypes(iob, return_counts=False, mode='entity')[source]¶
Return the sorted unique entity types for annotations in IOB2 format.
- Parameters
iob (pd.Series[str]) – Annotations in the IOB2 format. Elements of the pd.Series should be either ‘O’, ‘B-ENTITY_TYPE’, or ‘I-ENTITY_TYPE’, where ‘ENTITY_TYPE’ is the name of some entity type.
return_counts (bool, optional) – If True, also return the number of times each unique entity type appears in the input.
mode (str, optional) – Evaluation mode. One of ‘entity’, ‘token’: notice that an ‘entity’ can span several tokens.
- Returns
unique (list[str]) – The sorted unique entity types.
unique_counts (list[int], optional) – The number of times each of the unique entity types comes up in the input. Only provided if return_counts is True.