bluesearch.sql module¶
SQL Related functions.
- class SentenceFilter(connection)[source]¶
Bases:
object
Filter sentence IDs by applying conditions.
Instantiate this class and apply different filters by calling the corresponding filtering methods in any order. Finally, call either the run() or the stream() method to obtain the filtered sentence IDs.
Example
import sqlalchemy connection = sqlalchemy.create_engine("...") filtered_sentence_ids = ( SentenceFilter(connection) .only_with_journal() .restrict_sentences_ids_to([1, 2, 3, 4, 5]) .date_range((2010, 2020)) .exclude_strings(["virus", "disease"]) .run() )
When the run() or the stream() method is called an SQL query is constructed and executed internally. For the example above it would have approximately the following form
SELECT sentence_id FROM sentences WHERE article_id IN ( SELECT article_id FROM articles WHERE publish_time BETWEEN '2010-01-01' AND '2020-12-31' AND journal IS NOT NULL ) AND sentence_id IN ('1', '2', '3', '4', '5') AND text NOT LIKE '%virus%' AND text NOT LIKE '%disease%'
- Parameters
connection (sqlalchemy.engine.Engine) – Connection to the database that contains the articles and sentences tables.
- date_range(date_range=None)[source]¶
Restrict to articles in a given date range.
- Parameters
date_range (tuple or None) – A tuple with two elements of the form (start_year, end_year). If None then nothing no date range is applied.
- Returns
self – The instance of SentenceFilter itself. Useful for chained applications of filters.
- Return type
- discard_bad_sentences(flag=True)[source]¶
Discard sentences that are flagged as bad.
- Parameters
flag (bool) – If True, then all sentences with True in the is_bad column are discarded.
- Returns
self – The instance of SentenceFilter itself. Useful for chained applications of filters.
- Return type
- exclude_strings(strings)[source]¶
Exclude sentences containing any of the given strings.
- Parameters
strings (list_like) – The strings to exclude.
- Returns
self – The instance of SentenceFilter itself. Useful for chained applications of filters.
- Return type
- include_strings(strings)[source]¶
Include only sentences containing all of the given strings.
- Parameters
strings (list_like) – The strings to include.
- Returns
self – The instance of SentenceFilter itself.
- Return type
- iterate(chunk_size)[source]¶
Run the filtering query and iterate over restricted sentence IDs.
- Parameters
chunk_size (int) – The size of the batches of sentence IDs that are yielded.
- Yields
result_arr (np.ndarray) – A 1-dimensional numpy array with the filtered sentence IDs. Its length will be at most equal to chunk_size.
- only_english(flag=True)[source]¶
Only select articles that are in English.
- Parameters
flag (bool) – If True, then only articles for which are in English will be selected.
- Returns
self – The instance of SentenceFilter itself. Useful for chained applications of filters.
- Return type
- only_with_journal(flag=True)[source]¶
Only select articles with a journal.
- Parameters
flag (bool) – If True, then only articles for which a journal was specified will be selected.
- Returns
self – The instance of SentenceFilter itself. Useful for chained applications of filters.
- Return type
- get_titles(article_ids, engine)[source]¶
Get article titles from the SQL database.
- Parameters
article_ids (iterable of int) – An iterable of article IDs.
engine (sqlalchemy.engine.Engine) – SQLAlchemy Engine connected to the database.
- Returns
titles – Dictionary mapping article IDs to the article titles.
- Return type
dict
- retrieve_article_ids(engine)[source]¶
Retrieve all articles_id from sentences table.
- Parameters
engine (sqlalchemy.engine.Engine) – SQLAlchemy Engine connected to the database.
- Returns
article_id_dict – Dictionary giving the corresponding article_id for a given sentence_id
- Return type
dict
- retrieve_article_metadata_from_article_id(article_id, engine)[source]¶
Retrieve article metadata given one article id.
- Parameters
article_id (int) – Article id for which need to retrieve the article metadata.
engine (sqlalchemy.engine.Engine) – SQLAlchemy Engine connected to the database.
- Returns
article – DataFrame containing the article metadata. The columns are ‘article_id’, ‘cord_uid’, ‘sha’, ‘source_x’, ‘title’, ‘doi’, ‘pmcid’, ‘pubmed_id’, ‘license’, ‘abstract’, ‘publish_time’, ‘authors’, ‘journal’, ‘mag_id’, ‘who_covidence_id’, ‘arxiv_id’, ‘pdf_json_files’, ‘pmc_json_files’, ‘url’, ‘s2_id’.
- Return type
pd.DataFrame
- retrieve_articles(article_ids, engine)[source]¶
Retrieve article given multiple article ids.
- Parameters
article_ids (list of int) – List of Article id for which need to retrieve the entire text article.
engine (sqlalchemy.engine.Engine) – SQLAlchemy Engine connected to the database.
- Returns
articles – DataFrame containing the articles divided into paragraphs. The columns are ‘article_id’, ‘paragraph_pos_in_article’, ‘text’, ‘section_name’.
- Return type
pd.DataFrame
- retrieve_mining_cache(identifiers, etypes, engine)[source]¶
Retrieve cached mining results.
- Parameters
identifiers (list of tuple) – Tuples of form (article_id, paragraph_pos_in_article). Note that if paragraph_pos_in_article is -1 then we are considering all the paragraphs.
etypes (list) – List of entity types to consider. Duplicates are removed automatically.
engine (sqlalchemy.engine.Engine) – SQLAlchemy Engine connected to the database.
- Returns
result – Selected rows of the mining_cache table.
- Return type
pd.DataFrame
- retrieve_paragraph(article_id, paragraph_pos_in_article, engine)[source]¶
Retrieve paragraph given one identifier (article_id, paragraph_pos_in_article).
- Parameters
article_id (int) – Article id.
paragraph_pos_in_article (int) – Relative position of a paragraph in an article. Note that the numbering starts from 0.
engine (sqlalchemy.engine.Engine) – SQLAlchemy Engine connected to the database.
- Returns
paragraph – pd.DataFrame with the paragraph and its metadata: article_id, text, section_name, paragraph_pos_in_article.
- Return type
pd.DataFrame
- retrieve_paragraph_from_sentence_id(sentence_id, engine)[source]¶
Retrieve paragraph given one sentence id.
- Parameters
sentence_id (int) – Sentence id for which need to retrieve the paragraph.
engine (sqlalchemy.engine.Engine) – SQLAlchemy Engine connected to the database.
- Returns
paragraph – If
str
then a paragraph containing the sentence of the given sentence_id. If None then the sentence_id was not found in the sentences table.- Return type
str or None
- retrieve_sentences_from_sentence_ids(sentence_ids, engine, keep_order=False)[source]¶
Retrieve sentences given sentence ids.
- Parameters
sentence_ids (iterable of int) – Sentence ids for which need to retrieve the text.
engine (sqlalchemy.engine.Engine) – SQLAlchemy Engine connected to the database.
keep_order (bool, optional) – Make sure that the order of sentence ID in the result data frame is the same. Note that the default value is False.
- Returns
df_sentences – Pandas DataFrame containing all sentences and their corresponding metadata: article_id, sentence_id, section_name, text, paragraph_pos_in_article.
- Return type
pd.DataFrame