bluesearch.search module

Collection of functions focused on searching.

class SearchEngine(embedding_models, precomputed_embeddings, indices, connection)[source]

Bases: object

Search locally using assets on disk.

This class requires for several deep-learning modules to be loaded and for pre-trained models, pre-computed embeddings, and the SQL database to be loaded in memory.

This is more or less a wrapper around run_search from bluesearch.search.

Parameters
  • embedding_models (dict) – The pre-trained models.

  • precomputed_embeddings (dict) – The pre-computed embeddings.

  • indices (np.ndarray) – 1D array containing sentence_ids corresponding to the rows of each of the values of precomputed_embeddings.

  • connection (sqlalchemy.engine.Engine) – The database connection.

get_top_k_results(k, similarities, restricted_sentence_ids, granularity='sentences')[source]

Retrieve top k results (granularity sentences or articles).

Parameters
  • k (int) – Top k results to retrieve.

  • similarities (torch.Tensor) – Similarities values

  • restricted_sentence_ids (torch.Tensor) – Tensor containing the sentences_ids to keep for the top k retrieving.

  • granularity (str) – One of (‘sentences’, ‘articles’).

Returns

  • top_sentence_ids (torch.Tensor) – 1D array representing the indices of the top k most relevant sentences. The size of this array is going to be either (k, ) or (len(restricted_sentences_ids), ). k being equal to k for granularity = ‘sentences’, and num of sentences for k unique articles for granularity = ‘articles’.

  • top_similarities (torch.Tensor) – 1D array representing the similarities for each of the top k sentences.

query(which_model, k, query_text, granularity='sentences', has_journal=False, is_english=True, discard_bad_sentences=False, date_range=None, deprioritize_strength='None', exclusion_text='', inclusion_text='', deprioritize_text=None, verbose=True)[source]

Do the search.

Parameters
  • which_model (str) – The name of the model to use.

  • k (int) – Number of top results to display.

  • query_text (str) – Query.

  • granularity (str) – One of (‘sentences’, ‘articles’). Search granularity.

  • has_journal (bool) – If True, only consider papers that have a journal information.

  • is_english (bool) – If True, only consider papers that are in English.

  • discard_bad_sentences (bool) – If True, then all sentences with marked as bad quality will be discarded.

  • date_range (tuple) – Tuple of form (start_year, end_year) representing the considered time range.

  • deprioritize_text (str) – Text query of text to be deprioritized.

  • deprioritize_strength (str, {'None', 'Weak', 'Mild', 'Strong', 'Stronger'}) – How strong the deprioritization is.

  • exclusion_text (str) – New line separated collection of strings that are automatically used to exclude a given sentence. If a sentence contains any of these strings then we filter it out.

  • inclusion_text (str) – New line separated collection of strings. Only sentences that contain all of these strings are going to make it through the filtering.

  • verbose (bool) – If True, then printing statistics to standard output.

Returns

  • sentence_ids (np.array) – 1D array representing the indices of the top k most relevant sentences. The size of this array is going to be either (k, ) or (len(restricted_sentences_ids), ).

  • similarities (np.array) – 1D array reresenting the similarities for each of the top k sentences. Note that this will include the deprioritization part.

  • stats (dict) – Various statistics. There are following keys:

    • ’query_embed_time’ - how much time it took to embed the query_text in seconds

    • ’deprioritize_embed_time’ - how much time it took to embed the deprioritize_text in seconds