bluesearch.database package

Submodules

Module contents

Embedding and Mining Databases.

class CORD19DatabaseCreation(data_path, engine)[source]

Bases: object

Create SQL database from a specified dataset.

Parameters
  • data_path (str or pathlib.Path) – Directory to the dataset where metadata.csv and all jsons file are located.

  • engine (SQLAlchemy.Engine) – Engine linked to the database.

max_text_length

Max length of values in MySQL column of type TEXT. We have to constraint our text values to be smaller than this value (especially articles.abstract and sentences.text).

Type

int

check_is_english(text)[source]

Check if the given text is English.

Note the algorithm seems to be non-deterministic, as mentioned in https://github.com/Mimino666/langdetect#basic-usage. This is the reason of using langdetect.DetectorFactory.seed = 0

Parameters

text (str) – Text to analyze.

Returns

lang – Whether the language of the provided text is in English or not. If the input text is an empty string, None is returned.

Return type

bool or None

construct()[source]

Construct the database.

segment(nlp, paragraphs)[source]

Segment a paragraph/article into sentences.

Parameters
  • nlp (spacy.language.Language) – Spacy pipeline applying sentence segmentation.

  • paragraphs (List of tuples (text, metadata)) – List of Paragraph/Article in raw text to segment into sentences. [(text, metadata), ].

Returns

all_sentences – List of all the sentences extracted from the paragraph.

Return type

list of dict

class CreateMiningCache(database_engine, ee_models_paths, target_table_name, workers_per_model=1)[source]

Bases: object

Create SQL database to save results of mining into a cache.

Parameters
  • database_engine (sqlalchemy.engine.Engine) – Connection to the CORD-19 database.

  • ee_models_paths (dict[str, pathlib.Path]) – Dictionary mapping entity type to model path detecting it.

  • target_table_name (str) – The target table name for the mining results.

  • workers_per_model (int, optional) – Number of max processes to spawn to run text mining and table population in parallel.

construct()[source]

Construct and populate the cache of mined results.

create_tasks(task_queues, workers_by_queue)[source]

Create tasks for the mining workers.

Parameters
  • task_queues (dict[str or pathlib.Path, multiprocessing.Queue]) – Task queues for different models. The keys are the model paths and the values are the actual queues.

  • workers_by_queue (dict[str]) – All worker processes working on tasks from a given queue.

do_mining()[source]

Do the parallelized mining.

mark_bad_sentences(engine, sentences_table_name)[source]

Flag bad sentences in SQL database.

Parameters
  • engine (sqlalchemy.engine.Engine) – The connection to an SQL database.

  • sentences_table_name (str) – The table with sentences.

Raises

RuntimeError – If the column “is_bad” is missing in the table provided.