bluesearch.database.cord_19 module¶

Module for the Database Creation.

class CORD19DatabaseCreation(data_path, engine)[source]¶

Bases: object

Create SQL database from a specified dataset.

Parameters

data_path (str or pathlib.Path) – Directory to the dataset where metadata.csv and all jsons file are located.
engine (SQLAlchemy.Engine) – Engine linked to the database.

max_text_length¶

Max length of values in MySQL column of type TEXT. We have to constraint our text values to be smaller than this value (especially articles.abstract and sentences.text).

Type: int

check_is_english(text)[source]¶

Check if the given text is English.

Note the algorithm seems to be non-deterministic, as mentioned in https://github.com/Mimino666/langdetect#basic-usage. This is the reason of using langdetect.DetectorFactory.seed = 0

Parameters: text (str) – Text to analyze.
Returns: lang – Whether the language of the provided text is in English or not. If the input text is an empty string, None is returned.
Return type: bool or None

construct()[source]¶: Construct the database.

segment(nlp, paragraphs)[source]¶

Segment a paragraph/article into sentences.

Parameters

nlp (spacy.language.Language) – Spacy pipeline applying sentence segmentation.
paragraphs (List of tuples (text, metadata)) – List of Paragraph/Article in raw text to segment into sentences. [(text, metadata), ].

Returns

all_sentences – List of all the sentences extracted from the paragraph.

Return type

list of dict

mark_bad_sentences(engine, sentences_table_name)[source]¶

Flag bad sentences in SQL database.

Parameters

engine (sqlalchemy.engine.Engine) – The connection to an SQL database.
sentences_table_name (str) – The table with sentences.

Raises

RuntimeError – If the column “is_bad” is missing in the table provided.