bluesearch.mining.attribute module

Classes and functions for attribute extraction.

class AttributeAnnotationTab(**kwargs)[source]

Bases: ipywidgets.widgets.widget_selectioncontainer.Tab

A tab widget for displaying attribute extractions.

It is a subclass of the ipywidgets.Tab class and contains the following four tabs: - Raw Text - Named Entites - Attributes - Table

set_text(text)[source]

Set the text for the widget.

Parameters

text (str) – The text to assign to this widget.

class AttributeExtractor(core_nlp_url, grobid_quantities_url, ee_model)[source]

Bases: object

Extract and analyze attributes in a given text.

static annotate_quantities(text, measurements)[source]

Annotate measurements in text using HTML/CSS styles.

Parameters
  • text (str) – The text to annotate.

  • measurements (list) – The Grobid measurements for the text. It is assumed that these measurements were obtained by calling get_grobid_measurements(text).

Returns

output – The annotated text.

Return type

IPython.core.display.HTML

are_linked(measurement, entity, core_nlp_sentence)[source]

Determine if a measurement and an entity are link.

Parameters
  • measurement (dict) – A Grobid measurement.

  • entity (spacy.tokens.Span) – A spacy named entity.

  • core_nlp_sentence (dict) – A CoreNLP sentences. The CoreNLP sentences can be obtained from core_nlp_response[“sentences”].

Returns

have_common_parents – Whether or not the entity is linked to the measurement.

Return type

bool

count_measurement_types(measurements)[source]

Count types of all given measurements.

Parameters

measurements (list) – A list of Grobid measurements.

Returns

all_type_counts – The counts of all measurement types.

Return type

collections.Counter

extract_attributes(text, linked_attributes_only=True, raw_attributes=False)[source]

Extract attributes from text.

Parameters
  • text (str) – The text for attribute extraction.

  • linked_attributes_only (bool) – If true then only those attributes will be recorded for which there is an associated named entity.

  • raw_attributes (bool) – If true then the resulting data frame will contain all attribute information in one single column with raw grobid measurements. If false then the raw data frame will be processed using process_raw_annotation_df

Returns

df – A pandas data frame with extracted attributes.

Return type

pd.DataFrame

find_all_parents(dependencies, tokens_d, tokens, parent_fn=None)[source]

Find all parents of a given CoreNLP token.

Parameters
  • dependencies (list) – CoreNLP dependencies found in response[‘sentences’][idx][[‘basicDependencies’]

  • tokens_d (dict) – CoreNLP token dictionary mapping token indices to tokens. See extract_attributes.

  • tokens (list) – List of token indices for which parents need to be found.

  • parent_fn (function) – An implementation of a parent finding strategy. Currently the available strategies are find_compound_parents and find_nn_parents. The latter seems to perform better.

Returns

parent_ids – A list of all parents found under the given strategy for the tokens provided.

Return type

list

find_nn_parents(dependencies, tokens_d, token_idx)[source]

Parse CoreNLP dependencies to find parents of token.

To link named entities to attributes parents for both entity tokens and attribute tokens need to be extracted. See extract_attributes for more information

This is one possible strategy for finding parents of a given token. Ascent the dependency tree until find a parent of type “NN”. Do this for all parents. If, as it seems, each node has at most one parent, then the results will be either one index or no indices.

Parameters
  • dependencies (list) – CoreNLP dependencies found in response[‘sentences’][idx][[‘basicDependencies’]

  • tokens_d (dict) – CoreNLP token dictionary mapping token indices to tokens. See extract_attributes.

  • token_idx (int) – The index of the token for which parents need to be found.

Returns

parents – A list of parents.

Return type

list

get_core_nlp_analysis(text)[source]

Send a CoreNLP query and return the result.

Parameters

text (str) – The text to analyze with CoreNLP.

Returns

response_json – The CoreNLP response.

Return type

dict

get_entity_tokens(entity, tokens)[source]

Associate a spacy entity to CoreNLP tokens.

Parameters
  • entity (spacy.tokens.Span) – A spacy entity extracted from the text. See extract_attributes for more details.

  • tokens (list) – CoreNLP tokens.

Returns

ids – A list of CoreNLP token IDs corresponding to the given entity.

Return type

list

get_grobid_measurements(text)[source]

Get measurements for text form Grobid server.

Parameters

text (str) – The text for the query.

Returns

measurements – All Grobid measurements extracted from the given text.

Return type

list_like

get_measurement_tokens(measurement, tokens)[source]

Associate a Grobid measurement to CoreNLP tokens.

See get_quantity_tokens for more details.

Parameters
  • measurement (dict) – A Grobid measurement.

  • tokens (list) – CoreNLP tokens.

Returns

ids – A list of CoreNLP token IDs corresponding to the given quantity.

Return type

list

get_measurement_type(measurement)[source]

Get the type of a Grobid measurement.

For measurements with multiple quantities the most common type is returned. In case of ties the empty type always loses.

Parameters

measurement (dict) – A Grobid measurement.

Returns

measurement_type – The type of the Grobid measurement.

Return type

str

static get_overlapping_token_ids(start, end, tokens)[source]

Find tokens intersecting the interval [start, end).

CoreNLP breaks a given text down into sentences, and each sentence is broken down into tokens. These can be accessed by response[‘sentences’][sentence_id][‘tokens’].

Each token corresponds to a position in the original text. This method determines which tokens would intersect a a given slice of this text.

Parameters
  • start (int) – The left boundary of the interval.

  • end (int) – The right boundary of the interval.

  • tokens (list) – The CoreNLP sentence tokens.

Returns

ids – A list of token indices that overlap with the given interval.

Return type

list

get_quantity_tokens(quantity, tokens)[source]

Associate a Grobid quantity to CoreNLP tokens.

Both the quantity and the tokens should originate from exactly the same text.

A quantity may be composed of multiple parts, e.g. a number and a unit, and therefore correspond to multiple CoreNLP tokens.

Parameters
  • quantity (dict) – A Grobid quantity.

  • tokens (list) – CoreNLP tokens.

Returns

ids – A list of CoreNLP token IDs corresponding to the given quantity.

Return type

list

static get_quantity_type(quantity)[source]

Get the type of a Grobid quantity.

The top-level Grobid object is a measurement. A measurement can contain one or more than one quantities.

Some Grobid quantities have a type attached to them, e.g. “mass”, “concentration”, etc. This is the type that is returned. For quantities without a type an empty string is returned.

Parameters

quantity (dict) – A Grobid quantity.

Returns

quantity_type – The type of the quantity.

Return type

str

static iter_parents(dependencies, token_idx)[source]

Iterate over all parents of a token.

It seems that each node has at most one parent, and that parent == 0 means no parent

Parameters
  • dependencies (list) – CoreNLP dependencies found in response[‘sentences’][idx][[‘basicDependencies’].

  • token_idx (int) – The index of the token for which parents need to be iterated.

Yields

parent_idx (int) – The index of a parent token.

static iter_quantities(measurement)[source]

Iterate over quantities in a Grobid measurement.

Parameters

measurement (dict) – A Grobid measurement.

Yields

quantity (dict) – A Grobid quantity in the given measurement.

measurement_to_str(measurement)[source]

Convert a Grobid measurement to string.

Parameters

measurement (dict) – A Grobid measurement.

Returns

quantities – String representations of quantities in a measurement. If the measurement contains only one quantity then its string representation is return as is. Otherwise a list of string representations of quantities is returned.

Return type

list or str

process_raw_annotation_df(df, copy=True)[source]

Add standard columns to attribute data frame.

Parameters
  • df (pd.DataFrame) – A data frame with measurements in a raw format. This can be obtained by calling extract_attributes with the parameter raw_attributes=True.

  • copy (bool) – If true then it is guaranteed that the original data frame won’t be modified.

Returns

df – A modified data frame with the raw attribute column replaced by a number of more explicit columns using the standard nomenclature.

Return type

pd.DataFrame

static quantity_to_str(quantity)[source]

Convert a Grobid quantity to string.

Parameters

quantity (dict) – A Grobid quantity.

Returns

result – A String representation of the quantity.

Return type

str

class TextCollectionWidget(**kwargs)[source]

Bases: ipywidgets.widgets.widget_box.VBox

A widget displaying annotations for a number o texts.

The text can be selected using a slider and the annotation results will be displayed in an AttributeAnnotationTab widget.