bluesearch.sql module¶

SQL Related functions.

class SentenceFilter(connection)[source]¶

Bases: object

Filter sentence IDs by applying conditions.

Instantiate this class and apply different filters by calling the corresponding filtering methods in any order. Finally, call either the run() or the stream() method to obtain the filtered sentence IDs.

Example

import sqlalchemy
connection = sqlalchemy.create_engine("...")
filtered_sentence_ids = (
    SentenceFilter(connection)
    .only_with_journal()
    .restrict_sentences_ids_to([1, 2, 3, 4, 5])
    .date_range((2010, 2020))
    .exclude_strings(["virus", "disease"])
    .run()
)

When the run() or the stream() method is called an SQL query is constructed and executed internally. For the example above it would have approximately the following form

SELECT sentence_id
FROM sentences
WHERE
    article_id IN (
        SELECT article_id
        FROM articles
        WHERE
            publish_time BETWEEN '2010-01-01' AND '2020-12-31' AND
            journal IS NOT NULL
    ) AND
    sentence_id IN ('1', '2', '3', '4', '5') AND
    text NOT LIKE '%virus%' AND
    text NOT LIKE '%disease%'

Parameters: connection (sqlalchemy.engine.Engine) – Connection to the database that contains the articles and sentences tables.

date_range(date_range=None)[source]¶

Restrict to articles in a given date range.

Parameters: date_range (tuple or None) – A tuple with two elements of the form (start_year, end_year). If None then nothing no date range is applied.
Returns: self – The instance of SentenceFilter itself. Useful for chained applications of filters.
Return type: SentenceFilter

discard_bad_sentences(flag=True)[source]¶

Discard sentences that are flagged as bad.

Parameters: flag (bool) – If True, then all sentences with True in the is_bad column are discarded.
Returns: self – The instance of SentenceFilter itself. Useful for chained applications of filters.
Return type: SentenceFilter

exclude_strings(strings)[source]¶

Exclude sentences containing any of the given strings.

Parameters: strings (list_like) – The strings to exclude.
Returns: self – The instance of SentenceFilter itself. Useful for chained applications of filters.
Return type: SentenceFilter

include_strings(strings)[source]¶

Include only sentences containing all of the given strings.

Parameters: strings (list_like) – The strings to include.
Returns: self – The instance of SentenceFilter itself.
Return type: SentenceFilter

iterate(chunk_size)[source]¶

Run the filtering query and iterate over restricted sentence IDs.

Parameters: chunk_size (int) – The size of the batches of sentence IDs that are yielded.
Yields: result_arr (np.ndarray) – A 1-dimensional numpy array with the filtered sentence IDs. Its length will be at most equal to chunk_size.

only_english(flag=True)[source]¶

Only select articles that are in English.

Parameters: flag (bool) – If True, then only articles for which are in English will be selected.
Returns: self – The instance of SentenceFilter itself. Useful for chained applications of filters.
Return type: SentenceFilter

only_with_journal(flag=True)[source]¶

Only select articles with a journal.

Parameters: flag (bool) – If True, then only articles for which a journal was specified will be selected.
Returns: self – The instance of SentenceFilter itself. Useful for chained applications of filters.
Return type: SentenceFilter

restrict_sentences_ids_to(sentence_ids)[source]¶

Restrict sentence IDs to the given ones.

Parameters: sentence_ids (list_like) – The sentence IDs to restrict to.
Returns: self – The instance of SentenceFilter itself. Useful for chained applications of filters.
Return type: SentenceFilter

run()[source]¶

Run the filtering query to find restricted sentence IDs.

Returns: result_arr – A 1-dimensional numpy array with the filtered sentence IDs.
Return type: np.ndarray

get_titles(article_ids, engine)[source]¶

Get article titles from the SQL database.

Parameters

article_ids (iterable of int) – An iterable of article IDs.
engine (sqlalchemy.engine.Engine) – SQLAlchemy Engine connected to the database.

Returns

titles – Dictionary mapping article IDs to the article titles.

Return type

dict

retrieve_article_ids(engine)[source]¶

Retrieve all articles_id from sentences table.

Parameters: engine (sqlalchemy.engine.Engine) – SQLAlchemy Engine connected to the database.
Returns: article_id_dict – Dictionary giving the corresponding article_id for a given sentence_id
Return type: dict

retrieve_article_metadata_from_article_id(article_id, engine)[source]¶

Retrieve article metadata given one article id.

Parameters

article_id (int) – Article id for which need to retrieve the article metadata.
engine (sqlalchemy.engine.Engine) – SQLAlchemy Engine connected to the database.

Returns

article – DataFrame containing the article metadata. The columns are ‘article_id’, ‘cord_uid’, ‘sha’, ‘source_x’, ‘title’, ‘doi’, ‘pmcid’, ‘pubmed_id’, ‘license’, ‘abstract’, ‘publish_time’, ‘authors’, ‘journal’, ‘mag_id’, ‘who_covidence_id’, ‘arxiv_id’, ‘pdf_json_files’, ‘pmc_json_files’, ‘url’, ‘s2_id’.

Return type

pd.DataFrame

retrieve_articles(article_ids, engine)[source]¶

Retrieve article given multiple article ids.

Parameters

article_ids (list of int) – List of Article id for which need to retrieve the entire text article.
engine (sqlalchemy.engine.Engine) – SQLAlchemy Engine connected to the database.

Returns

articles – DataFrame containing the articles divided into paragraphs. The columns are ‘article_id’, ‘paragraph_pos_in_article’, ‘text’, ‘section_name’.

Return type

pd.DataFrame

retrieve_mining_cache(identifiers, etypes, engine)[source]¶

Retrieve cached mining results.

Parameters

identifiers (list of tuple) – Tuples of form (article_id, paragraph_pos_in_article). Note that if paragraph_pos_in_article is -1 then we are considering all the paragraphs.
etypes (list) – List of entity types to consider. Duplicates are removed automatically.
engine (sqlalchemy.engine.Engine) – SQLAlchemy Engine connected to the database.

Returns

result – Selected rows of the mining_cache table.

Return type

pd.DataFrame

retrieve_paragraph(article_id, paragraph_pos_in_article, engine)[source]¶

Retrieve paragraph given one identifier (article_id, paragraph_pos_in_article).

Parameters

article_id (int) – Article id.
paragraph_pos_in_article (int) – Relative position of a paragraph in an article. Note that the numbering starts from 0.
engine (sqlalchemy.engine.Engine) – SQLAlchemy Engine connected to the database.

Returns

paragraph – pd.DataFrame with the paragraph and its metadata: article_id, text, section_name, paragraph_pos_in_article.

Return type

pd.DataFrame

retrieve_paragraph_from_sentence_id(sentence_id, engine)[source]¶

Retrieve paragraph given one sentence id.

Parameters

sentence_id (int) – Sentence id for which need to retrieve the paragraph.
engine (sqlalchemy.engine.Engine) – SQLAlchemy Engine connected to the database.

Returns

paragraph – If str then a paragraph containing the sentence of the given sentence_id. If None then the sentence_id was not found in the sentences table.

Return type

str or None

retrieve_sentences_from_sentence_ids(sentence_ids, engine, keep_order=False)[source]¶

Retrieve sentences given sentence ids.

Parameters

sentence_ids (iterable of int) – Sentence ids for which need to retrieve the text.
engine (sqlalchemy.engine.Engine) – SQLAlchemy Engine connected to the database.
keep_order (bool, optional) – Make sure that the order of sentence ID in the result data frame is the same. Note that the default value is False.

Returns

df_sentences – Pandas DataFrame containing all sentences and their corresponding metadata: article_id, sentence_id, section_name, text, paragraph_pos_in_article.

Return type

pd.DataFrame