bluesearch.mining.attribute module¶

Classes and functions for attribute extraction.

class AttributeAnnotationTab(**kwargs)[source]¶

Bases: ipywidgets.widgets.widget_selectioncontainer.Tab

A tab widget for displaying attribute extractions.

It is a subclass of the ipywidgets.Tab class and contains the following four tabs: - Raw Text - Named Entites - Attributes - Table

set_text(text)[source]¶

Set the text for the widget.

Parameters: text (str) – The text to assign to this widget.

class AttributeExtractor(core_nlp_url, grobid_quantities_url, ee_model)[source]¶

Bases: object

Extract and analyze attributes in a given text.

static annotate_quantities(text, measurements)[source]¶

Annotate measurements in text using HTML/CSS styles.

Parameters

text (str) – The text to annotate.
measurements (list) – The Grobid measurements for the text. It is assumed that these measurements were obtained by calling get_grobid_measurements(text).

Returns

output – The annotated text.

Return type

IPython.core.display.HTML

are_linked(measurement, entity, core_nlp_sentence)[source]¶

Determine if a measurement and an entity are link.

Parameters

measurement (dict) – A Grobid measurement.
entity (spacy.tokens.Span) – A spacy named entity.
core_nlp_sentence (dict) – A CoreNLP sentences. The CoreNLP sentences can be obtained from core_nlp_response[“sentences”].

Returns

have_common_parents – Whether or not the entity is linked to the measurement.

Return type

bool

count_measurement_types(measurements)[source]¶

Count types of all given measurements.

Parameters: measurements (list) – A list of Grobid measurements.
Returns: all_type_counts – The counts of all measurement types.
Return type: collections.Counter

extract_attributes(text, linked_attributes_only=True, raw_attributes=False)[source]¶

Extract attributes from text.

Parameters

text (str) – The text for attribute extraction.
linked_attributes_only (bool) – If true then only those attributes will be recorded for which there is an associated named entity.
raw_attributes (bool) – If true then the resulting data frame will contain all attribute information in one single column with raw grobid measurements. If false then the raw data frame will be processed using process_raw_annotation_df

Returns

df – A pandas data frame with extracted attributes.

Return type

pd.DataFrame

find_all_parents(dependencies, tokens_d, tokens, parent_fn=None)[source]¶

Find all parents of a given CoreNLP token.

Parameters

dependencies (list) – CoreNLP dependencies found in response[‘sentences’][idx][[‘basicDependencies’]
tokens_d (dict) – CoreNLP token dictionary mapping token indices to tokens. See extract_attributes.
tokens (list) – List of token indices for which parents need to be found.
parent_fn (function) – An implementation of a parent finding strategy. Currently the available strategies are find_compound_parents and find_nn_parents. The latter seems to perform better.

Returns

parent_ids – A list of all parents found under the given strategy for the tokens provided.

Return type

list

find_nn_parents(dependencies, tokens_d, token_idx)[source]¶

Parse CoreNLP dependencies to find parents of token.

To link named entities to attributes parents for both entity tokens and attribute tokens need to be extracted. See extract_attributes for more information

This is one possible strategy for finding parents of a given token. Ascent the dependency tree until find a parent of type “NN”. Do this for all parents. If, as it seems, each node has at most one parent, then the results will be either one index or no indices.

Parameters

dependencies (list) – CoreNLP dependencies found in response[‘sentences’][idx][[‘basicDependencies’]
tokens_d (dict) – CoreNLP token dictionary mapping token indices to tokens. See extract_attributes.
token_idx (int) – The index of the token for which parents need to be found.

Returns

parents – A list of parents.

Return type

list

get_core_nlp_analysis(text)[source]¶

Send a CoreNLP query and return the result.

Parameters: text (str) – The text to analyze with CoreNLP.
Returns: response_json – The CoreNLP response.
Return type: dict

get_entity_tokens(entity, tokens)[source]¶

Associate a spacy entity to CoreNLP tokens.

Parameters

entity (spacy.tokens.Span) – A spacy entity extracted from the text. See extract_attributes for more details.
tokens (list) – CoreNLP tokens.

Returns

ids – A list of CoreNLP token IDs corresponding to the given entity.

Return type

list

get_grobid_measurements(text)[source]¶

Get measurements for text form Grobid server.

Parameters: text (str) – The text for the query.
Returns: measurements – All Grobid measurements extracted from the given text.
Return type: list_like

get_measurement_tokens(measurement, tokens)[source]¶

Associate a Grobid measurement to CoreNLP tokens.

See get_quantity_tokens for more details.

Parameters

measurement (dict) – A Grobid measurement.
tokens (list) – CoreNLP tokens.

Returns

ids – A list of CoreNLP token IDs corresponding to the given quantity.

Return type

list

get_measurement_type(measurement)[source]¶

Get the type of a Grobid measurement.

For measurements with multiple quantities the most common type is returned. In case of ties the empty type always loses.

Parameters: measurement (dict) – A Grobid measurement.
Returns: measurement_type – The type of the Grobid measurement.
Return type: str

static get_overlapping_token_ids(start, end, tokens)[source]¶

Find tokens intersecting the interval [start, end).

CoreNLP breaks a given text down into sentences, and each sentence is broken down into tokens. These can be accessed by response[‘sentences’][sentence_id][‘tokens’].

Each token corresponds to a position in the original text. This method determines which tokens would intersect a a given slice of this text.

Parameters

start (int) – The left boundary of the interval.
end (int) – The right boundary of the interval.
tokens (list) – The CoreNLP sentence tokens.

Returns

ids – A list of token indices that overlap with the given interval.

Return type

list

get_quantity_tokens(quantity, tokens)[source]¶

Associate a Grobid quantity to CoreNLP tokens.

Both the quantity and the tokens should originate from exactly the same text.

A quantity may be composed of multiple parts, e.g. a number and a unit, and therefore correspond to multiple CoreNLP tokens.

Parameters

quantity (dict) – A Grobid quantity.
tokens (list) – CoreNLP tokens.

Returns

ids – A list of CoreNLP token IDs corresponding to the given quantity.

Return type

list

static get_quantity_type(quantity)[source]¶

Get the type of a Grobid quantity.

The top-level Grobid object is a measurement. A measurement can contain one or more than one quantities.

Some Grobid quantities have a type attached to them, e.g. “mass”, “concentration”, etc. This is the type that is returned. For quantities without a type an empty string is returned.

Parameters: quantity (dict) – A Grobid quantity.
Returns: quantity_type – The type of the quantity.
Return type: str

static iter_parents(dependencies, token_idx)[source]¶

Iterate over all parents of a token.

It seems that each node has at most one parent, and that parent == 0 means no parent

Parameters

dependencies (list) – CoreNLP dependencies found in response[‘sentences’][idx][[‘basicDependencies’].
token_idx (int) – The index of the token for which parents need to be iterated.

Yields

parent_idx (int) – The index of a parent token.

static iter_quantities(measurement)[source]¶

Iterate over quantities in a Grobid measurement.

Parameters: measurement (dict) – A Grobid measurement.
Yields: quantity (dict) – A Grobid quantity in the given measurement.

measurement_to_str(measurement)[source]¶

Convert a Grobid measurement to string.

Parameters: measurement (dict) – A Grobid measurement.
Returns: quantities – String representations of quantities in a measurement. If the measurement contains only one quantity then its string representation is return as is. Otherwise a list of string representations of quantities is returned.
Return type: list or str

process_raw_annotation_df(df, copy=True)[source]¶

Add standard columns to attribute data frame.

Parameters

df (pd.DataFrame) – A data frame with measurements in a raw format. This can be obtained by calling extract_attributes with the parameter raw_attributes=True.
copy (bool) – If true then it is guaranteed that the original data frame won’t be modified.

Returns

df – A modified data frame with the raw attribute column replaced by a number of more explicit columns using the standard nomenclature.

Return type

pd.DataFrame

static quantity_to_str(quantity)[source]¶

Convert a Grobid quantity to string.

Parameters: quantity (dict) – A Grobid quantity.
Returns: result – A String representation of the quantity.
Return type: str

class TextCollectionWidget(**kwargs)[source]¶

Bases: ipywidgets.widgets.widget_box.VBox

A widget displaying annotations for a number o texts.

The text can be selected using a slider and the annotation results will be displayed in an AttributeAnnotationTab widget.