bluesearch.embedding_models module¶

Model handling sentences embeddings.

class EmbeddingModel[source]¶

Bases: abc.ABC

Abstract interface for the Sentences Embeddings Models.

abstract property dim¶: Return dimension of the embedding.

abstract embed(preprocessed_sentence)[source]¶

Compute the sentences embeddings for a given sentence.

Parameters: preprocessed_sentence (str) – Preprocessed sentence to embed.
Returns: embedding – One dimensional vector representing the embedding of the given sentence.
Return type: numpy.array

embed_many(preprocessed_sentences)[source]¶

Compute sentence embeddings for all provided sentences.

This is a default implementation. Children classes can implement more sophisticated batching schemes.

Parameters: preprocessed_sentences (list of str) – List of preprocessed sentences.
Returns: embeddings – 2D numpy array with shape (len(preprocessed_sentences), self.dim). Each row is an embedding of a sentence in preprocessed_sentences.
Return type: np.ndarray

preprocess(raw_sentence)[source]¶

Preprocess the sentence (Tokenization, …) if needed by the model.

This is a default implementation that perform no preprocessing. Model specific preprocessing can be define within children classes.

Parameters: raw_sentence (str) – Raw sentence to embed.
Returns: Preprocessed sentence in the format expected by the model if needed.
Return type: preprocessed_sentence

preprocess_many(raw_sentences)[source]¶

Preprocess multiple sentences.

This is a default implementation and can be overridden by children classes.

Parameters: raw_sentences (list of str) – List of str representing raw sentences that we want to embed.
Returns: List of preprocessed sentences corresponding to raw_sentences.
Return type: preprocessed_sentences

class MPEmbedder(database_url, model_name_or_class, indices, h5_path_output, batch_size_inference=16, batch_size_transfer=1000, n_processes=2, checkpoint_path=None, gpus=None, delete_temp=True, temp_folder=None, h5_dataset_name=None, start_method='forkserver', preinitialize=True)[source]¶

Bases: object

Embedding of sentences with multiprocessing.

Parameters

database_url (str) – URL of the database.
model_name_or_class (str) – The name or class of the model for which to compute the embeddings.
indices (np.ndarray) – 1D array storing the sentence_ids for which we want to compute the embedding.
h5_path_output (pathlib.Path) – Path to where the output h5 file will be lying.
batch_size_inference (int) – Number of sentences to preprocess and embed at the same time. Should lead to major speedups. Note that the last batch will have a length of n_sentences % batch_size (unless it is 0). Note that some models (SBioBERT) might perform padding to the longest sentence in the batch and bigger batch size might not lead to a speedup.
batch_size_transfer (int) – Batch size to be used for transfering data from the temporary h5 files to the final h5 file.
n_processes (int) – Number of processes to use. Note that each process gets len(indices) / n_processes sentences to embed.
checkpoint_path (pathlib.Path or None) – If ‘model_name_or_class’ is the class, the path of the model to load. Otherwise, this argument is ignored.
gpus (None or list) – If not specified, all processes will be using CPU. If not None, then it needs to be a list of length n_processes where each element represents the GPU id (integer) to be used. None elements will be interpreted as CPU.
delete_temp (bool) – If True, the temporary h5 files are deleted after the final h5 is created. Disabling this flag is useful for testing and debugging purposes.
temp_folder (None or pathlib.Path) – If None, then all temporary h5 files stored into the same folder as the output h5 file. Otherwise they are stored in the specified folder.
h5_dataset_name (str or None) – The name of the dataset in the H5 file. Otherwise, the value of ‘model_name_or_class’ is used.
start_method (str, {"fork", "forkserver", "spawn"}) – Start method for multiprocessing. Note that using “fork” might lead to problems when doing GPU inference.
preinitialize (bool) – If True we instantiate the model before running multiprocessing in order to download any checkpoints. Once instantiated, the model will be deleted.

do_embedding()[source]¶: Do the parallelized embedding.

static run_embedding_worker(database_url, model_name_or_class, indices, temp_h5_path, batch_size, checkpoint_path, gpu, h5_dataset_name)[source]¶

Run per worker function.

Parameters

database_url (str) – URL of the database.
model_name_or_class (str) – The name or class of the model for which to compute the embeddings.
indices (np.ndarray) – 1D array of sentences ids indices representing what the worker needs to embed.
temp_h5_path (pathlib.Path) – Path to where we store the temporary h5 file.
batch_size (int) – Number of sentences in the batch.
checkpoint_path (pathlib.Path or None) – If ‘model_name_or_class’ is the class, the path of the model to load. Otherwise, this argument is ignored.
gpu (int or None) – If None, we are going to use a CPU. Otherwise, we use a GPU with the specified id.
h5_dataset_name (str or None) – The name of the dataset in the H5 file.

class SentTransformer(model_name_or_path, device=None)[source]¶

Bases: bluesearch.embedding_models.EmbeddingModel

Sentence Transformer.

Parameters: model_name_or_path (pathlib.Path or str) – The name or the path of the Transformer model to load.

References

https://github.com/UKPLab/sentence-transformers

property dim¶: Return dimension of the embedding.

embed(preprocessed_sentence)[source]¶

Compute the sentences embeddings for a given sentence.

Parameters: preprocessed_sentence (str) – Preprocessed sentence to embed.
Returns: embedding – Embedding of the given sentence of shape (768,).
Return type: numpy.array

embed_many(preprocessed_sentences)[source]¶

Compute sentence embeddings for multiple sentences.

Parameters: preprocessed_sentences (list of str) – Preprocessed sentences to embed.
Returns: embedding – Embedding of the specified sentences of shape (len(preprocessed_sentences), 768).
Return type: numpy.array

class SklearnVectorizer(checkpoint_path)[source]¶

Bases: bluesearch.embedding_models.EmbeddingModel

Simple wrapper for sklearn vectorizer models.

Parameters: checkpoint_path (pathlib.Path or str) – The path of the scikit-learn model to use for the embeddings in Pickle format.

property dim¶

Return dimension of the embedding.

Returns: dim – The dimension of the embedding.
Return type: int

embed(preprocessed_sentence)[source]¶

Embed one given sentence.

Parameters: preprocessed_sentence (str) – Preprocessed sentence to embed. Can by obtained using the preprocess or preprocess_many methods.
Returns: embedding – Array of shape (dim,) with the sentence embedding.
Return type: numpy.ndarray

embed_many(preprocessed_sentences)[source]¶

Compute sentence embeddings for multiple sentences.

Parameters: preprocessed_sentences (iterable of str) – Preprocessed sentences to embed. Can by obtained using the preprocess or preprocess_many methods.
Returns: embeddings – Array of shape (len(preprocessed_sentences), dim) with the sentence embeddings.
Return type: numpy.ndarray

compute_database_embeddings(connection, model, indices, batch_size=10)[source]¶

Compute sentences embeddings.

The embeddings are computed for a given model and a given database (articles with covid19_tag True).

Parameters

connection (sqlalchemy.engine.Engine) – Connection to the database.
model (EmbeddingModel) – Instance of the EmbeddingModel of choice.
indices (np.ndarray) – 1D array storing the sentence_ids for which we want to perform the embedding.
batch_size (int) – Number of sentences to preprocess and embed at the same time. Should lead to major speedups. Note that the last batch will have a length of n_sentences % batch_size (unless it is 0). Note that some models (SBioBERT) might perform padding to the longest sentence and bigger batch size might not lead to a speedup.

Returns

final_embeddings (np.array) – 2D numpy array with all sentences embeddings for the given models. Its shape is (len(retrieved_indices), dim).
retrieved_indices (np.ndarray) – 1D array of sentence_ids that we managed to embed. Note that the order corresponds exactly to the rows in final_embeddings.

get_embedding_model(model_name_or_class: str, checkpoint_path: Optional[Union[pathlib.Path, str]] = None, device: str = 'cpu') → bluesearch.embedding_models.EmbeddingModel[source]¶

Load a sentence embedding model from its name or its class and checkpoint.

Usage:

For defined models:
- BioBERT NLI+STS: get_embedding_model(‘BioBERT NLI+STS’, device=<device>)
- SBioBERT: get_embedding_model(‘SBioBERT’, device=<device>)
- SBERT: get_embedding_model(‘SBERT’, device=<device>)
For arbitrary models:
- My Transformer model: get_embedding_model(‘SentTransformer’, <model_name_or_path>, <device>)
- My scikit-learn model: get_embedding_model(‘SklearnVectorizer’, <checkpoint_path>)

Parameters

model_name_or_class – The name or class of the embedding model to load.
checkpoint_path – If ‘model_name_or_class’ is the class, this parameter is required and it is the path of the embedding model to load.
device – The target device to which load the model (‘cpu’ or ‘cuda’).

Returns

sentence_embedding_model – The sentence embedding model instance.

Return type

EmbeddingModel