bluesearch.database.cord_19 module

Module for the Database Creation.

class CORD19DatabaseCreation(data_path, engine)[source]

Bases: object

Create SQL database from a specified dataset.

Parameters
  • data_path (str or pathlib.Path) – Directory to the dataset where metadata.csv and all jsons file are located.

  • engine (SQLAlchemy.Engine) – Engine linked to the database.

max_text_length

Max length of values in MySQL column of type TEXT. We have to constraint our text values to be smaller than this value (especially articles.abstract and sentences.text).

Type

int

check_is_english(text)[source]

Check if the given text is English.

Note the algorithm seems to be non-deterministic, as mentioned in https://github.com/Mimino666/langdetect#basic-usage. This is the reason of using langdetect.DetectorFactory.seed = 0

Parameters

text (str) – Text to analyze.

Returns

lang – Whether the language of the provided text is in English or not. If the input text is an empty string, None is returned.

Return type

bool or None

construct()[source]

Construct the database.

segment(nlp, paragraphs)[source]

Segment a paragraph/article into sentences.

Parameters
  • nlp (spacy.language.Language) – Spacy pipeline applying sentence segmentation.

  • paragraphs (List of tuples (text, metadata)) – List of Paragraph/Article in raw text to segment into sentences. [(text, metadata), ].

Returns

all_sentences – List of all the sentences extracted from the paragraph.

Return type

list of dict

mark_bad_sentences(engine, sentences_table_name)[source]

Flag bad sentences in SQL database.

Parameters
  • engine (sqlalchemy.engine.Engine) – The connection to an SQL database.

  • sentences_table_name (str) – The table with sentences.

Raises

RuntimeError – If the column “is_bad” is missing in the table provided.