bluesearch.embedding_models module¶
Model handling sentences embeddings.
- class EmbeddingModel[source]¶
Bases:
abc.ABC
Abstract interface for the Sentences Embeddings Models.
- abstract property dim¶
Return dimension of the embedding.
- abstract embed(preprocessed_sentence)[source]¶
Compute the sentences embeddings for a given sentence.
- Parameters
preprocessed_sentence (str) – Preprocessed sentence to embed.
- Returns
embedding – One dimensional vector representing the embedding of the given sentence.
- Return type
numpy.array
- embed_many(preprocessed_sentences)[source]¶
Compute sentence embeddings for all provided sentences.
This is a default implementation. Children classes can implement more sophisticated batching schemes.
- Parameters
preprocessed_sentences (list of str) – List of preprocessed sentences.
- Returns
embeddings – 2D numpy array with shape (len(preprocessed_sentences), self.dim). Each row is an embedding of a sentence in preprocessed_sentences.
- Return type
np.ndarray
- preprocess(raw_sentence)[source]¶
Preprocess the sentence (Tokenization, …) if needed by the model.
This is a default implementation that perform no preprocessing. Model specific preprocessing can be define within children classes.
- Parameters
raw_sentence (str) – Raw sentence to embed.
- Returns
Preprocessed sentence in the format expected by the model if needed.
- Return type
preprocessed_sentence
- preprocess_many(raw_sentences)[source]¶
Preprocess multiple sentences.
This is a default implementation and can be overridden by children classes.
- Parameters
raw_sentences (list of str) – List of str representing raw sentences that we want to embed.
- Returns
List of preprocessed sentences corresponding to raw_sentences.
- Return type
preprocessed_sentences
- class MPEmbedder(database_url, model_name_or_class, indices, h5_path_output, batch_size_inference=16, batch_size_transfer=1000, n_processes=2, checkpoint_path=None, gpus=None, delete_temp=True, temp_folder=None, h5_dataset_name=None, start_method='forkserver', preinitialize=True)[source]¶
Bases:
object
Embedding of sentences with multiprocessing.
- Parameters
database_url (str) – URL of the database.
model_name_or_class (str) – The name or class of the model for which to compute the embeddings.
indices (np.ndarray) – 1D array storing the sentence_ids for which we want to compute the embedding.
h5_path_output (pathlib.Path) – Path to where the output h5 file will be lying.
batch_size_inference (int) – Number of sentences to preprocess and embed at the same time. Should lead to major speedups. Note that the last batch will have a length of n_sentences % batch_size (unless it is 0). Note that some models (SBioBERT) might perform padding to the longest sentence in the batch and bigger batch size might not lead to a speedup.
batch_size_transfer (int) – Batch size to be used for transfering data from the temporary h5 files to the final h5 file.
n_processes (int) – Number of processes to use. Note that each process gets len(indices) / n_processes sentences to embed.
checkpoint_path (pathlib.Path or None) – If ‘model_name_or_class’ is the class, the path of the model to load. Otherwise, this argument is ignored.
gpus (None or list) – If not specified, all processes will be using CPU. If not None, then it needs to be a list of length n_processes where each element represents the GPU id (integer) to be used. None elements will be interpreted as CPU.
delete_temp (bool) – If True, the temporary h5 files are deleted after the final h5 is created. Disabling this flag is useful for testing and debugging purposes.
temp_folder (None or pathlib.Path) – If None, then all temporary h5 files stored into the same folder as the output h5 file. Otherwise they are stored in the specified folder.
h5_dataset_name (str or None) – The name of the dataset in the H5 file. Otherwise, the value of ‘model_name_or_class’ is used.
start_method (str, {"fork", "forkserver", "spawn"}) – Start method for multiprocessing. Note that using “fork” might lead to problems when doing GPU inference.
preinitialize (bool) – If True we instantiate the model before running multiprocessing in order to download any checkpoints. Once instantiated, the model will be deleted.
- static run_embedding_worker(database_url, model_name_or_class, indices, temp_h5_path, batch_size, checkpoint_path, gpu, h5_dataset_name)[source]¶
Run per worker function.
- Parameters
database_url (str) – URL of the database.
model_name_or_class (str) – The name or class of the model for which to compute the embeddings.
indices (np.ndarray) – 1D array of sentences ids indices representing what the worker needs to embed.
temp_h5_path (pathlib.Path) – Path to where we store the temporary h5 file.
batch_size (int) – Number of sentences in the batch.
checkpoint_path (pathlib.Path or None) – If ‘model_name_or_class’ is the class, the path of the model to load. Otherwise, this argument is ignored.
gpu (int or None) – If None, we are going to use a CPU. Otherwise, we use a GPU with the specified id.
h5_dataset_name (str or None) – The name of the dataset in the H5 file.
- class SentTransformer(model_name_or_path, device=None)[source]¶
Bases:
bluesearch.embedding_models.EmbeddingModel
Sentence Transformer.
- Parameters
model_name_or_path (pathlib.Path or str) – The name or the path of the Transformer model to load.
References
https://github.com/UKPLab/sentence-transformers
- property dim¶
Return dimension of the embedding.
- embed(preprocessed_sentence)[source]¶
Compute the sentences embeddings for a given sentence.
- Parameters
preprocessed_sentence (str) – Preprocessed sentence to embed.
- Returns
embedding – Embedding of the given sentence of shape (768,).
- Return type
numpy.array
- embed_many(preprocessed_sentences)[source]¶
Compute sentence embeddings for multiple sentences.
- Parameters
preprocessed_sentences (list of str) – Preprocessed sentences to embed.
- Returns
embedding – Embedding of the specified sentences of shape (len(preprocessed_sentences), 768).
- Return type
numpy.array
- class SklearnVectorizer(checkpoint_path)[source]¶
Bases:
bluesearch.embedding_models.EmbeddingModel
Simple wrapper for sklearn vectorizer models.
- Parameters
checkpoint_path (pathlib.Path or str) – The path of the scikit-learn model to use for the embeddings in Pickle format.
- property dim¶
Return dimension of the embedding.
- Returns
dim – The dimension of the embedding.
- Return type
int
- embed(preprocessed_sentence)[source]¶
Embed one given sentence.
- Parameters
preprocessed_sentence (str) – Preprocessed sentence to embed. Can by obtained using the preprocess or preprocess_many methods.
- Returns
embedding – Array of shape (dim,) with the sentence embedding.
- Return type
numpy.ndarray
- embed_many(preprocessed_sentences)[source]¶
Compute sentence embeddings for multiple sentences.
- Parameters
preprocessed_sentences (iterable of str) – Preprocessed sentences to embed. Can by obtained using the preprocess or preprocess_many methods.
- Returns
embeddings – Array of shape (len(preprocessed_sentences), dim) with the sentence embeddings.
- Return type
numpy.ndarray
- compute_database_embeddings(connection, model, indices, batch_size=10)[source]¶
Compute sentences embeddings.
The embeddings are computed for a given model and a given database (articles with covid19_tag True).
- Parameters
connection (sqlalchemy.engine.Engine) – Connection to the database.
model (EmbeddingModel) – Instance of the EmbeddingModel of choice.
indices (np.ndarray) – 1D array storing the sentence_ids for which we want to perform the embedding.
batch_size (int) – Number of sentences to preprocess and embed at the same time. Should lead to major speedups. Note that the last batch will have a length of n_sentences % batch_size (unless it is 0). Note that some models (SBioBERT) might perform padding to the longest sentence and bigger batch size might not lead to a speedup.
- Returns
final_embeddings (np.array) – 2D numpy array with all sentences embeddings for the given models. Its shape is (len(retrieved_indices), dim).
retrieved_indices (np.ndarray) – 1D array of sentence_ids that we managed to embed. Note that the order corresponds exactly to the rows in final_embeddings.
- get_embedding_model(model_name_or_class: str, checkpoint_path: Optional[Union[pathlib.Path, str]] = None, device: str = 'cpu') bluesearch.embedding_models.EmbeddingModel [source]¶
Load a sentence embedding model from its name or its class and checkpoint.
Usage:
- For defined models:
BioBERT NLI+STS: get_embedding_model(‘BioBERT NLI+STS’, device=<device>)
SBioBERT: get_embedding_model(‘SBioBERT’, device=<device>)
SBERT: get_embedding_model(‘SBERT’, device=<device>)
- For arbitrary models:
My Transformer model: get_embedding_model(‘SentTransformer’, <model_name_or_path>, <device>)
My scikit-learn model: get_embedding_model(‘SklearnVectorizer’, <checkpoint_path>)
- Parameters
model_name_or_class – The name or class of the embedding model to load.
checkpoint_path – If ‘model_name_or_class’ is the class, this parameter is required and it is the path of the embedding model to load.
device – The target device to which load the model (‘cpu’ or ‘cuda’).
- Returns
sentence_embedding_model – The sentence embedding model instance.
- Return type