bluesearch.mining.attribute module¶
Classes and functions for attribute extraction.
- class AttributeAnnotationTab(**kwargs)[source]¶
Bases:
ipywidgets.widgets.widget_selectioncontainer.Tab
A tab widget for displaying attribute extractions.
It is a subclass of the ipywidgets.Tab class and contains the following four tabs: - Raw Text - Named Entites - Attributes - Table
- class AttributeExtractor(core_nlp_url, grobid_quantities_url, ee_model)[source]¶
Bases:
object
Extract and analyze attributes in a given text.
- static annotate_quantities(text, measurements)[source]¶
Annotate measurements in text using HTML/CSS styles.
- Parameters
text (str) – The text to annotate.
measurements (list) – The Grobid measurements for the text. It is assumed that these measurements were obtained by calling get_grobid_measurements(text).
- Returns
output – The annotated text.
- Return type
IPython.core.display.HTML
- are_linked(measurement, entity, core_nlp_sentence)[source]¶
Determine if a measurement and an entity are link.
- Parameters
measurement (dict) – A Grobid measurement.
entity (spacy.tokens.Span) – A spacy named entity.
core_nlp_sentence (dict) – A CoreNLP sentences. The CoreNLP sentences can be obtained from core_nlp_response[“sentences”].
- Returns
have_common_parents – Whether or not the entity is linked to the measurement.
- Return type
bool
- count_measurement_types(measurements)[source]¶
Count types of all given measurements.
- Parameters
measurements (list) – A list of Grobid measurements.
- Returns
all_type_counts – The counts of all measurement types.
- Return type
collections.Counter
- extract_attributes(text, linked_attributes_only=True, raw_attributes=False)[source]¶
Extract attributes from text.
- Parameters
text (str) – The text for attribute extraction.
linked_attributes_only (bool) – If true then only those attributes will be recorded for which there is an associated named entity.
raw_attributes (bool) – If true then the resulting data frame will contain all attribute information in one single column with raw grobid measurements. If false then the raw data frame will be processed using process_raw_annotation_df
- Returns
df – A pandas data frame with extracted attributes.
- Return type
pd.DataFrame
- find_all_parents(dependencies, tokens_d, tokens, parent_fn=None)[source]¶
Find all parents of a given CoreNLP token.
- Parameters
dependencies (list) – CoreNLP dependencies found in response[‘sentences’][idx][[‘basicDependencies’]
tokens_d (dict) – CoreNLP token dictionary mapping token indices to tokens. See extract_attributes.
tokens (list) – List of token indices for which parents need to be found.
parent_fn (function) – An implementation of a parent finding strategy. Currently the available strategies are find_compound_parents and find_nn_parents. The latter seems to perform better.
- Returns
parent_ids – A list of all parents found under the given strategy for the tokens provided.
- Return type
list
- find_nn_parents(dependencies, tokens_d, token_idx)[source]¶
Parse CoreNLP dependencies to find parents of token.
To link named entities to attributes parents for both entity tokens and attribute tokens need to be extracted. See extract_attributes for more information
This is one possible strategy for finding parents of a given token. Ascent the dependency tree until find a parent of type “NN”. Do this for all parents. If, as it seems, each node has at most one parent, then the results will be either one index or no indices.
- Parameters
dependencies (list) – CoreNLP dependencies found in response[‘sentences’][idx][[‘basicDependencies’]
tokens_d (dict) – CoreNLP token dictionary mapping token indices to tokens. See extract_attributes.
token_idx (int) – The index of the token for which parents need to be found.
- Returns
parents – A list of parents.
- Return type
list
- get_core_nlp_analysis(text)[source]¶
Send a CoreNLP query and return the result.
- Parameters
text (str) – The text to analyze with CoreNLP.
- Returns
response_json – The CoreNLP response.
- Return type
dict
- get_entity_tokens(entity, tokens)[source]¶
Associate a spacy entity to CoreNLP tokens.
- Parameters
entity (spacy.tokens.Span) – A spacy entity extracted from the text. See extract_attributes for more details.
tokens (list) – CoreNLP tokens.
- Returns
ids – A list of CoreNLP token IDs corresponding to the given entity.
- Return type
list
- get_grobid_measurements(text)[source]¶
Get measurements for text form Grobid server.
- Parameters
text (str) – The text for the query.
- Returns
measurements – All Grobid measurements extracted from the given text.
- Return type
list_like
- get_measurement_tokens(measurement, tokens)[source]¶
Associate a Grobid measurement to CoreNLP tokens.
See get_quantity_tokens for more details.
- Parameters
measurement (dict) – A Grobid measurement.
tokens (list) – CoreNLP tokens.
- Returns
ids – A list of CoreNLP token IDs corresponding to the given quantity.
- Return type
list
- get_measurement_type(measurement)[source]¶
Get the type of a Grobid measurement.
For measurements with multiple quantities the most common type is returned. In case of ties the empty type always loses.
- Parameters
measurement (dict) – A Grobid measurement.
- Returns
measurement_type – The type of the Grobid measurement.
- Return type
str
- static get_overlapping_token_ids(start, end, tokens)[source]¶
Find tokens intersecting the interval [start, end).
CoreNLP breaks a given text down into sentences, and each sentence is broken down into tokens. These can be accessed by response[‘sentences’][sentence_id][‘tokens’].
Each token corresponds to a position in the original text. This method determines which tokens would intersect a a given slice of this text.
- Parameters
start (int) – The left boundary of the interval.
end (int) – The right boundary of the interval.
tokens (list) – The CoreNLP sentence tokens.
- Returns
ids – A list of token indices that overlap with the given interval.
- Return type
list
- get_quantity_tokens(quantity, tokens)[source]¶
Associate a Grobid quantity to CoreNLP tokens.
Both the quantity and the tokens should originate from exactly the same text.
A quantity may be composed of multiple parts, e.g. a number and a unit, and therefore correspond to multiple CoreNLP tokens.
- Parameters
quantity (dict) – A Grobid quantity.
tokens (list) – CoreNLP tokens.
- Returns
ids – A list of CoreNLP token IDs corresponding to the given quantity.
- Return type
list
- static get_quantity_type(quantity)[source]¶
Get the type of a Grobid quantity.
The top-level Grobid object is a measurement. A measurement can contain one or more than one quantities.
Some Grobid quantities have a type attached to them, e.g. “mass”, “concentration”, etc. This is the type that is returned. For quantities without a type an empty string is returned.
- Parameters
quantity (dict) – A Grobid quantity.
- Returns
quantity_type – The type of the quantity.
- Return type
str
- static iter_parents(dependencies, token_idx)[source]¶
Iterate over all parents of a token.
It seems that each node has at most one parent, and that parent == 0 means no parent
- Parameters
dependencies (list) – CoreNLP dependencies found in response[‘sentences’][idx][[‘basicDependencies’].
token_idx (int) – The index of the token for which parents need to be iterated.
- Yields
parent_idx (int) – The index of a parent token.
- static iter_quantities(measurement)[source]¶
Iterate over quantities in a Grobid measurement.
- Parameters
measurement (dict) – A Grobid measurement.
- Yields
quantity (dict) – A Grobid quantity in the given measurement.
- measurement_to_str(measurement)[source]¶
Convert a Grobid measurement to string.
- Parameters
measurement (dict) – A Grobid measurement.
- Returns
quantities – String representations of quantities in a measurement. If the measurement contains only one quantity then its string representation is return as is. Otherwise a list of string representations of quantities is returned.
- Return type
list or str
- process_raw_annotation_df(df, copy=True)[source]¶
Add standard columns to attribute data frame.
- Parameters
df (pd.DataFrame) – A data frame with measurements in a raw format. This can be obtained by calling extract_attributes with the parameter raw_attributes=True.
copy (bool) – If true then it is guaranteed that the original data frame won’t be modified.
- Returns
df – A modified data frame with the raw attribute column replaced by a number of more explicit columns using the standard nomenclature.
- Return type
pd.DataFrame