bluesearch.embedding_models module

Model handling sentences embeddings.

class EmbeddingModel[source]

Bases: abc.ABC

Abstract interface for the Sentences Embeddings Models.

abstract property dim

Return dimension of the embedding.

abstract embed(preprocessed_sentence)[source]

Compute the sentences embeddings for a given sentence.

Parameters

preprocessed_sentence (str) – Preprocessed sentence to embed.

Returns

embedding – One dimensional vector representing the embedding of the given sentence.

Return type

numpy.array

embed_many(preprocessed_sentences)[source]

Compute sentence embeddings for all provided sentences.

This is a default implementation. Children classes can implement more sophisticated batching schemes.

Parameters

preprocessed_sentences (list of str) – List of preprocessed sentences.

Returns

embeddings – 2D numpy array with shape (len(preprocessed_sentences), self.dim). Each row is an embedding of a sentence in preprocessed_sentences.

Return type

np.ndarray

preprocess(raw_sentence)[source]

Preprocess the sentence (Tokenization, …) if needed by the model.

This is a default implementation that perform no preprocessing. Model specific preprocessing can be define within children classes.

Parameters

raw_sentence (str) – Raw sentence to embed.

Returns

Preprocessed sentence in the format expected by the model if needed.

Return type

preprocessed_sentence

preprocess_many(raw_sentences)[source]

Preprocess multiple sentences.

This is a default implementation and can be overridden by children classes.

Parameters

raw_sentences (list of str) – List of str representing raw sentences that we want to embed.

Returns

List of preprocessed sentences corresponding to raw_sentences.

Return type

preprocessed_sentences

class MPEmbedder(database_url, model_name_or_class, indices, h5_path_output, batch_size_inference=16, batch_size_transfer=1000, n_processes=2, checkpoint_path=None, gpus=None, delete_temp=True, temp_folder=None, h5_dataset_name=None, start_method='forkserver', preinitialize=True)[source]

Bases: object

Embedding of sentences with multiprocessing.

Parameters
  • database_url (str) – URL of the database.

  • model_name_or_class (str) – The name or class of the model for which to compute the embeddings.

  • indices (np.ndarray) – 1D array storing the sentence_ids for which we want to compute the embedding.

  • h5_path_output (pathlib.Path) – Path to where the output h5 file will be lying.

  • batch_size_inference (int) – Number of sentences to preprocess and embed at the same time. Should lead to major speedups. Note that the last batch will have a length of n_sentences % batch_size (unless it is 0). Note that some models (SBioBERT) might perform padding to the longest sentence in the batch and bigger batch size might not lead to a speedup.

  • batch_size_transfer (int) – Batch size to be used for transfering data from the temporary h5 files to the final h5 file.

  • n_processes (int) – Number of processes to use. Note that each process gets len(indices) / n_processes sentences to embed.

  • checkpoint_path (pathlib.Path or None) – If ‘model_name_or_class’ is the class, the path of the model to load. Otherwise, this argument is ignored.

  • gpus (None or list) – If not specified, all processes will be using CPU. If not None, then it needs to be a list of length n_processes where each element represents the GPU id (integer) to be used. None elements will be interpreted as CPU.

  • delete_temp (bool) – If True, the temporary h5 files are deleted after the final h5 is created. Disabling this flag is useful for testing and debugging purposes.

  • temp_folder (None or pathlib.Path) – If None, then all temporary h5 files stored into the same folder as the output h5 file. Otherwise they are stored in the specified folder.

  • h5_dataset_name (str or None) – The name of the dataset in the H5 file. Otherwise, the value of ‘model_name_or_class’ is used.

  • start_method (str, {"fork", "forkserver", "spawn"}) – Start method for multiprocessing. Note that using “fork” might lead to problems when doing GPU inference.

  • preinitialize (bool) – If True we instantiate the model before running multiprocessing in order to download any checkpoints. Once instantiated, the model will be deleted.

do_embedding()[source]

Do the parallelized embedding.

static run_embedding_worker(database_url, model_name_or_class, indices, temp_h5_path, batch_size, checkpoint_path, gpu, h5_dataset_name)[source]

Run per worker function.

Parameters
  • database_url (str) – URL of the database.

  • model_name_or_class (str) – The name or class of the model for which to compute the embeddings.

  • indices (np.ndarray) – 1D array of sentences ids indices representing what the worker needs to embed.

  • temp_h5_path (pathlib.Path) – Path to where we store the temporary h5 file.

  • batch_size (int) – Number of sentences in the batch.

  • checkpoint_path (pathlib.Path or None) – If ‘model_name_or_class’ is the class, the path of the model to load. Otherwise, this argument is ignored.

  • gpu (int or None) – If None, we are going to use a CPU. Otherwise, we use a GPU with the specified id.

  • h5_dataset_name (str or None) – The name of the dataset in the H5 file.

class SentTransformer(model_name_or_path, device=None)[source]

Bases: bluesearch.embedding_models.EmbeddingModel

Sentence Transformer.

Parameters

model_name_or_path (pathlib.Path or str) – The name or the path of the Transformer model to load.

References

https://github.com/UKPLab/sentence-transformers

property dim

Return dimension of the embedding.

embed(preprocessed_sentence)[source]

Compute the sentences embeddings for a given sentence.

Parameters

preprocessed_sentence (str) – Preprocessed sentence to embed.

Returns

embedding – Embedding of the given sentence of shape (768,).

Return type

numpy.array

embed_many(preprocessed_sentences)[source]

Compute sentence embeddings for multiple sentences.

Parameters

preprocessed_sentences (list of str) – Preprocessed sentences to embed.

Returns

embedding – Embedding of the specified sentences of shape (len(preprocessed_sentences), 768).

Return type

numpy.array

class SklearnVectorizer(checkpoint_path)[source]

Bases: bluesearch.embedding_models.EmbeddingModel

Simple wrapper for sklearn vectorizer models.

Parameters

checkpoint_path (pathlib.Path or str) – The path of the scikit-learn model to use for the embeddings in Pickle format.

property dim

Return dimension of the embedding.

Returns

dim – The dimension of the embedding.

Return type

int

embed(preprocessed_sentence)[source]

Embed one given sentence.

Parameters

preprocessed_sentence (str) – Preprocessed sentence to embed. Can by obtained using the preprocess or preprocess_many methods.

Returns

embedding – Array of shape (dim,) with the sentence embedding.

Return type

numpy.ndarray

embed_many(preprocessed_sentences)[source]

Compute sentence embeddings for multiple sentences.

Parameters

preprocessed_sentences (iterable of str) – Preprocessed sentences to embed. Can by obtained using the preprocess or preprocess_many methods.

Returns

embeddings – Array of shape (len(preprocessed_sentences), dim) with the sentence embeddings.

Return type

numpy.ndarray

compute_database_embeddings(connection, model, indices, batch_size=10)[source]

Compute sentences embeddings.

The embeddings are computed for a given model and a given database (articles with covid19_tag True).

Parameters
  • connection (sqlalchemy.engine.Engine) – Connection to the database.

  • model (EmbeddingModel) – Instance of the EmbeddingModel of choice.

  • indices (np.ndarray) – 1D array storing the sentence_ids for which we want to perform the embedding.

  • batch_size (int) – Number of sentences to preprocess and embed at the same time. Should lead to major speedups. Note that the last batch will have a length of n_sentences % batch_size (unless it is 0). Note that some models (SBioBERT) might perform padding to the longest sentence and bigger batch size might not lead to a speedup.

Returns

  • final_embeddings (np.array) – 2D numpy array with all sentences embeddings for the given models. Its shape is (len(retrieved_indices), dim).

  • retrieved_indices (np.ndarray) – 1D array of sentence_ids that we managed to embed. Note that the order corresponds exactly to the rows in final_embeddings.

get_embedding_model(model_name_or_class: str, checkpoint_path: Optional[Union[pathlib.Path, str]] = None, device: str = 'cpu') bluesearch.embedding_models.EmbeddingModel[source]

Load a sentence embedding model from its name or its class and checkpoint.

Usage:

  • For defined models:
    • BioBERT NLI+STS: get_embedding_model(‘BioBERT NLI+STS’, device=<device>)

    • SBioBERT: get_embedding_model(‘SBioBERT’, device=<device>)

    • SBERT: get_embedding_model(‘SBERT’, device=<device>)

  • For arbitrary models:
    • My Transformer model: get_embedding_model(‘SentTransformer’, <model_name_or_path>, <device>)

    • My scikit-learn model: get_embedding_model(‘SklearnVectorizer’, <checkpoint_path>)

Parameters
  • model_name_or_class – The name or class of the embedding model to load.

  • checkpoint_path – If ‘model_name_or_class’ is the class, this parameter is required and it is the path of the embedding model to load.

  • device – The target device to which load the model (‘cpu’ or ‘cuda’).

Returns

sentence_embedding_model – The sentence embedding model instance.

Return type

EmbeddingModel