bluesearch.database package¶

Submodules¶

Module contents¶

Embedding and Mining Databases.

class CORD19DatabaseCreation(data_path, engine)[source]¶

Bases: object

Create SQL database from a specified dataset.

Parameters

data_path (str or pathlib.Path) – Directory to the dataset where metadata.csv and all jsons file are located.
engine (SQLAlchemy.Engine) – Engine linked to the database.

max_text_length¶

Max length of values in MySQL column of type TEXT. We have to constraint our text values to be smaller than this value (especially articles.abstract and sentences.text).

Type: int

check_is_english(text)[source]¶

Check if the given text is English.

Note the algorithm seems to be non-deterministic, as mentioned in https://github.com/Mimino666/langdetect#basic-usage. This is the reason of using langdetect.DetectorFactory.seed = 0

Parameters: text (str) – Text to analyze.
Returns: lang – Whether the language of the provided text is in English or not. If the input text is an empty string, None is returned.
Return type: bool or None

construct()[source]¶: Construct the database.

segment(nlp, paragraphs)[source]¶

Segment a paragraph/article into sentences.

Parameters

nlp (spacy.language.Language) – Spacy pipeline applying sentence segmentation.
paragraphs (List of tuples (text, metadata)) – List of Paragraph/Article in raw text to segment into sentences. [(text, metadata), ].

Returns

all_sentences – List of all the sentences extracted from the paragraph.

Return type

list of dict

class CreateMiningCache(database_engine, ee_models_paths, target_table_name, workers_per_model=1)[source]¶

Bases: object

Create SQL database to save results of mining into a cache.

Parameters

database_engine (sqlalchemy.engine.Engine) – Connection to the CORD-19 database.
ee_models_paths (dict[str, pathlib.Path]) – Dictionary mapping entity type to model path detecting it.
target_table_name (str) – The target table name for the mining results.
workers_per_model (int, optional) – Number of max processes to spawn to run text mining and table population in parallel.

construct()[source]¶: Construct and populate the cache of mined results.

create_tasks(task_queues, workers_by_queue)[source]¶

Create tasks for the mining workers.

Parameters

task_queues (dict[str or pathlib.Path, multiprocessing.Queue]) – Task queues for different models. The keys are the model paths and the values are the actual queues.
workers_by_queue (dict[str]) – All worker processes working on tasks from a given queue.

do_mining()[source]¶: Do the parallelized mining.

mark_bad_sentences(engine, sentences_table_name)[source]¶

Flag bad sentences in SQL database.

Parameters

engine (sqlalchemy.engine.Engine) – The connection to an SQL database.
sentences_table_name (str) – The table with sentences.

Raises

RuntimeError – If the column “is_bad” is missing in the table provided.