bluesearch.database.cord_19 module¶
Module for the Database Creation.
- class CORD19DatabaseCreation(data_path, engine)[source]¶
Bases:
object
Create SQL database from a specified dataset.
- Parameters
data_path (str or pathlib.Path) – Directory to the dataset where metadata.csv and all jsons file are located.
engine (SQLAlchemy.Engine) – Engine linked to the database.
- max_text_length¶
Max length of values in MySQL column of type TEXT. We have to constraint our text values to be smaller than this value (especially articles.abstract and sentences.text).
- Type
int
- check_is_english(text)[source]¶
Check if the given text is English.
Note the algorithm seems to be non-deterministic, as mentioned in https://github.com/Mimino666/langdetect#basic-usage. This is the reason of using langdetect.DetectorFactory.seed = 0
- Parameters
text (str) – Text to analyze.
- Returns
lang – Whether the language of the provided text is in English or not. If the input text is an empty string, None is returned.
- Return type
bool or None
- segment(nlp, paragraphs)[source]¶
Segment a paragraph/article into sentences.
- Parameters
nlp (spacy.language.Language) – Spacy pipeline applying sentence segmentation.
paragraphs (List of tuples (text, metadata)) – List of Paragraph/Article in raw text to segment into sentences. [(text, metadata), ].
- Returns
all_sentences – List of all the sentences extracted from the paragraph.
- Return type
list of dict
- mark_bad_sentences(engine, sentences_table_name)[source]¶
Flag bad sentences in SQL database.
- Parameters
engine (sqlalchemy.engine.Engine) – The connection to an SQL database.
sentences_table_name (str) – The table with sentences.
- Raises
RuntimeError – If the column “is_bad” is missing in the table provided.