bluesearch.search module¶

Collection of functions focused on searching.

class SearchEngine(embedding_models, precomputed_embeddings, indices, connection)[source]¶

Bases: object

Search locally using assets on disk.

This class requires for several deep-learning modules to be loaded and for pre-trained models, pre-computed embeddings, and the SQL database to be loaded in memory.

This is more or less a wrapper around run_search from bluesearch.search.

Parameters

embedding_models (dict) – The pre-trained models.
precomputed_embeddings (dict) – The pre-computed embeddings.
indices (np.ndarray) – 1D array containing sentence_ids corresponding to the rows of each of the values of precomputed_embeddings.
connection (sqlalchemy.engine.Engine) – The database connection.

get_top_k_results(k, similarities, restricted_sentence_ids, granularity='sentences')[source]¶

Retrieve top k results (granularity sentences or articles).

Parameters

k (int) – Top k results to retrieve.
similarities (torch.Tensor) – Similarities values
restricted_sentence_ids (torch.Tensor) – Tensor containing the sentences_ids to keep for the top k retrieving.
granularity (str) – One of (‘sentences’, ‘articles’).

Returns

top_sentence_ids (torch.Tensor) – 1D array representing the indices of the top k most relevant sentences. The size of this array is going to be either (k, ) or (len(restricted_sentences_ids), ). k being equal to k for granularity = ‘sentences’, and num of sentences for k unique articles for granularity = ‘articles’.
top_similarities (torch.Tensor) – 1D array representing the similarities for each of the top k sentences.

query(which_model, k, query_text, granularity='sentences', has_journal=False, is_english=True, discard_bad_sentences=False, date_range=None, deprioritize_strength='None', exclusion_text='', inclusion_text='', deprioritize_text=None, verbose=True)[source]¶

Do the search.

Parameters

which_model (str) – The name of the model to use.
k (int) – Number of top results to display.
query_text (str) – Query.
granularity (str) – One of (‘sentences’, ‘articles’). Search granularity.
has_journal (bool) – If True, only consider papers that have a journal information.
is_english (bool) – If True, only consider papers that are in English.
discard_bad_sentences (bool) – If True, then all sentences with marked as bad quality will be discarded.
date_range (tuple) – Tuple of form (start_year, end_year) representing the considered time range.
deprioritize_text (str) – Text query of text to be deprioritized.
deprioritize_strength (str, {'None', 'Weak', 'Mild', 'Strong', 'Stronger'}) – How strong the deprioritization is.
exclusion_text (str) – New line separated collection of strings that are automatically used to exclude a given sentence. If a sentence contains any of these strings then we filter it out.
inclusion_text (str) – New line separated collection of strings. Only sentences that contain all of these strings are going to make it through the filtering.
verbose (bool) – If True, then printing statistics to standard output.

Returns

sentence_ids (np.array) – 1D array representing the indices of the top k most relevant sentences. The size of this array is going to be either (k, ) or (len(restricted_sentences_ids), ).
similarities (np.array) – 1D array reresenting the similarities for each of the top k sentences. Note that this will include the deprioritization part.
stats (dict) – Various statistics. There are following keys:
- ’query_embed_time’ - how much time it took to embed the query_text in seconds
- ’deprioritize_embed_time’ - how much time it took to embed the deprioritize_text in seconds