bluesearch.sql module

SQL Related functions.

class SentenceFilter(connection)[source]

Bases: object

Filter sentence IDs by applying conditions.

Instantiate this class and apply different filters by calling the corresponding filtering methods in any order. Finally, call either the run() or the stream() method to obtain the filtered sentence IDs.

Example

import sqlalchemy
connection = sqlalchemy.create_engine("...")
filtered_sentence_ids = (
    SentenceFilter(connection)
    .only_with_journal()
    .restrict_sentences_ids_to([1, 2, 3, 4, 5])
    .date_range((2010, 2020))
    .exclude_strings(["virus", "disease"])
    .run()
)

When the run() or the stream() method is called an SQL query is constructed and executed internally. For the example above it would have approximately the following form

SELECT sentence_id
FROM sentences
WHERE
    article_id IN (
        SELECT article_id
        FROM articles
        WHERE
            publish_time BETWEEN '2010-01-01' AND '2020-12-31' AND
            journal IS NOT NULL
    ) AND
    sentence_id IN ('1', '2', '3', '4', '5') AND
    text NOT LIKE '%virus%' AND
    text NOT LIKE '%disease%'
Parameters

connection (sqlalchemy.engine.Engine) – Connection to the database that contains the articles and sentences tables.

date_range(date_range=None)[source]

Restrict to articles in a given date range.

Parameters

date_range (tuple or None) – A tuple with two elements of the form (start_year, end_year). If None then nothing no date range is applied.

Returns

self – The instance of SentenceFilter itself. Useful for chained applications of filters.

Return type

SentenceFilter

discard_bad_sentences(flag=True)[source]

Discard sentences that are flagged as bad.

Parameters

flag (bool) – If True, then all sentences with True in the is_bad column are discarded.

Returns

self – The instance of SentenceFilter itself. Useful for chained applications of filters.

Return type

SentenceFilter

exclude_strings(strings)[source]

Exclude sentences containing any of the given strings.

Parameters

strings (list_like) – The strings to exclude.

Returns

self – The instance of SentenceFilter itself. Useful for chained applications of filters.

Return type

SentenceFilter

include_strings(strings)[source]

Include only sentences containing all of the given strings.

Parameters

strings (list_like) – The strings to include.

Returns

self – The instance of SentenceFilter itself.

Return type

SentenceFilter

iterate(chunk_size)[source]

Run the filtering query and iterate over restricted sentence IDs.

Parameters

chunk_size (int) – The size of the batches of sentence IDs that are yielded.

Yields

result_arr (np.ndarray) – A 1-dimensional numpy array with the filtered sentence IDs. Its length will be at most equal to chunk_size.

only_english(flag=True)[source]

Only select articles that are in English.

Parameters

flag (bool) – If True, then only articles for which are in English will be selected.

Returns

self – The instance of SentenceFilter itself. Useful for chained applications of filters.

Return type

SentenceFilter

only_with_journal(flag=True)[source]

Only select articles with a journal.

Parameters

flag (bool) – If True, then only articles for which a journal was specified will be selected.

Returns

self – The instance of SentenceFilter itself. Useful for chained applications of filters.

Return type

SentenceFilter

restrict_sentences_ids_to(sentence_ids)[source]

Restrict sentence IDs to the given ones.

Parameters

sentence_ids (list_like) – The sentence IDs to restrict to.

Returns

self – The instance of SentenceFilter itself. Useful for chained applications of filters.

Return type

SentenceFilter

run()[source]

Run the filtering query to find restricted sentence IDs.

Returns

result_arr – A 1-dimensional numpy array with the filtered sentence IDs.

Return type

np.ndarray

get_titles(article_ids, engine)[source]

Get article titles from the SQL database.

Parameters
  • article_ids (iterable of int) – An iterable of article IDs.

  • engine (sqlalchemy.engine.Engine) – SQLAlchemy Engine connected to the database.

Returns

titles – Dictionary mapping article IDs to the article titles.

Return type

dict

retrieve_article_ids(engine)[source]

Retrieve all articles_id from sentences table.

Parameters

engine (sqlalchemy.engine.Engine) – SQLAlchemy Engine connected to the database.

Returns

article_id_dict – Dictionary giving the corresponding article_id for a given sentence_id

Return type

dict

retrieve_article_metadata_from_article_id(article_id, engine)[source]

Retrieve article metadata given one article id.

Parameters
  • article_id (int) – Article id for which need to retrieve the article metadata.

  • engine (sqlalchemy.engine.Engine) – SQLAlchemy Engine connected to the database.

Returns

article – DataFrame containing the article metadata. The columns are ‘article_id’, ‘cord_uid’, ‘sha’, ‘source_x’, ‘title’, ‘doi’, ‘pmcid’, ‘pubmed_id’, ‘license’, ‘abstract’, ‘publish_time’, ‘authors’, ‘journal’, ‘mag_id’, ‘who_covidence_id’, ‘arxiv_id’, ‘pdf_json_files’, ‘pmc_json_files’, ‘url’, ‘s2_id’.

Return type

pd.DataFrame

retrieve_articles(article_ids, engine)[source]

Retrieve article given multiple article ids.

Parameters
  • article_ids (list of int) – List of Article id for which need to retrieve the entire text article.

  • engine (sqlalchemy.engine.Engine) – SQLAlchemy Engine connected to the database.

Returns

articles – DataFrame containing the articles divided into paragraphs. The columns are ‘article_id’, ‘paragraph_pos_in_article’, ‘text’, ‘section_name’.

Return type

pd.DataFrame

retrieve_mining_cache(identifiers, etypes, engine)[source]

Retrieve cached mining results.

Parameters
  • identifiers (list of tuple) – Tuples of form (article_id, paragraph_pos_in_article). Note that if paragraph_pos_in_article is -1 then we are considering all the paragraphs.

  • etypes (list) – List of entity types to consider. Duplicates are removed automatically.

  • engine (sqlalchemy.engine.Engine) – SQLAlchemy Engine connected to the database.

Returns

result – Selected rows of the mining_cache table.

Return type

pd.DataFrame

retrieve_paragraph(article_id, paragraph_pos_in_article, engine)[source]

Retrieve paragraph given one identifier (article_id, paragraph_pos_in_article).

Parameters
  • article_id (int) – Article id.

  • paragraph_pos_in_article (int) – Relative position of a paragraph in an article. Note that the numbering starts from 0.

  • engine (sqlalchemy.engine.Engine) – SQLAlchemy Engine connected to the database.

Returns

paragraph – pd.DataFrame with the paragraph and its metadata: article_id, text, section_name, paragraph_pos_in_article.

Return type

pd.DataFrame

retrieve_paragraph_from_sentence_id(sentence_id, engine)[source]

Retrieve paragraph given one sentence id.

Parameters
  • sentence_id (int) – Sentence id for which need to retrieve the paragraph.

  • engine (sqlalchemy.engine.Engine) – SQLAlchemy Engine connected to the database.

Returns

paragraph – If str then a paragraph containing the sentence of the given sentence_id. If None then the sentence_id was not found in the sentences table.

Return type

str or None

retrieve_sentences_from_sentence_ids(sentence_ids, engine, keep_order=False)[source]

Retrieve sentences given sentence ids.

Parameters
  • sentence_ids (iterable of int) – Sentence ids for which need to retrieve the text.

  • engine (sqlalchemy.engine.Engine) – SQLAlchemy Engine connected to the database.

  • keep_order (bool, optional) – Make sure that the order of sentence ID in the result data frame is the same. Note that the default value is False.

Returns

df_sentences – Pandas DataFrame containing all sentences and their corresponding metadata: article_id, sentence_id, section_name, text, paragraph_pos_in_article.

Return type

pd.DataFrame