bluesearch.utils module

Generic Utils.

class H5[source]

Bases: object

H5 utilities.

static clear(h5_path, dataset_name, indices)[source]

Set selected rows to the fillvalue.

Parameters
  • h5_path (pathlib.Path) – Path to the h5 file.

  • dataset_name (str) – Name of the dataset.

  • indices (np.ndarray) – 1D array that determines the rows to be set to fillvalue.

static concatenate(h5_path_output, dataset_name, h5_paths_temp, delete_inputs=True, batch_size=2000)[source]

Concatenate multiple h5 files into one h5 file.

Parameters
  • h5_path_output (pathlib.Path) – Path to the h5 file. Note that this file can already exist and contain other datasets.

  • dataset_name (str) – Name of the dataset.

  • h5_paths_temp (list) –

    Paths to the input h5 files. Note that each of them will have 2 datasets.
    • {dataset_name} - dtype = float and shape (length, dim)

    • {dataset_name}_indices - dtype = int and shape (length, 1)

  • delete_inputs (bool) – If True, then all input h5 files are deleted once the concatenation is done.

  • batch_size (int) – Batch size to be used for transfers from the input h5 to the final one.

static create(h5_path, dataset_name, shape, dtype='f4')[source]

Create a dataset (and potentially also a h5 file).

Parameters
  • h5_path (pathlib.Path) – Path to the h5 file.

  • dataset_name (str) – Name of the dataset.

  • shape (tuple of int) – Two element tuple representing rows and columns.

  • dtype (str) – Dtype of the h5 array. See references for all the details.

Notes

Unpopulated rows will be filled with np.nan.

References

[1] http://docs.h5py.org/en/stable/faq.html#faq

static find_populated_rows(h5_path, dataset_name, batch_size=2000, verbose=False)[source]

Identify rows that are populated (= not nan vectors).

Parameters
  • h5_path (pathlib.Path) – Path to the h5 file.

  • dataset_name (str) – Name of the dataset.

  • batch_size (int) – Number of rows to be loaded at a time.

  • verbose (bool) – Controls verbosity.

Returns

pop_rows – 1D numpy array of ints representing row indices of populated rows (not nan).

Return type

np.ndarray

static find_unpopulated_rows(h5_path, dataset_name, batch_size=2000, verbose=False)[source]

Return the indices of rows that are unpopulated.

Parameters
  • h5_path (pathlib.Path) – Path to the h5 file.

  • dataset_name (str) – Name of the dataset.

  • batch_size (int) – Number of rows to be loaded at a time.

  • verbose (bool) – Controls verbosity.

Returns

unpop_rows – 1D numpy array of ints representing row indices of unpopulated rows (nan).

Return type

np.ndarray

static get_shape(h5_path, dataset_name)[source]

Get shape of a dataset.

Parameters
  • h5_path (pathlib.Path) – Path to the h5 file.

  • dataset_name (str) – Name of the dataset.

static load(h5_path, dataset_name, batch_size=500, indices=None, verbose=False)[source]

Load an h5 file in memory.

Parameters
  • h5_path (pathlib.Path) – Path to the h5 file.

  • dataset_name (str) – Name of the dataset.

  • batch_size (int) – Number of rows to be loaded at a time.

  • indices (None or np.ndarray) – If None then we load all the rows from the dataset. If np.ndarray then the loading only selected indices.

  • verbose (bool) – Controls verbosity.

Returns

res – Numpy array of shape (len(indices), …) holding the loaded rows.

Return type

np.ndarray

static write(h5_path, dataset_name, data, indices)[source]

Write a numpy array into an h5 file.

Parameters
  • h5_path (pathlib.Path) – Path to the h5 file.

  • dataset_name (str) – Name of the dataset.

  • data (np.ndarray) – 2D numpy array to be written into the h5 file.

  • indices (np.ndarray) – 1D numpy array that determines row indices whre the data pasted.

class JSONL[source]

Bases: object

Collection of utility static functions handling jsonl files.

static dump_jsonl(data, path)[source]

Save a list of dictionaries to a jsonl.

Parameters
  • data (list) – List of dictionaries (json files).

  • path (pathlib.Path) – File where to save it.

static load_jsonl(path)[source]

Read jsonl into a list of dictionaries.

Parameters

path (pathlib.Path) – Path to the .jsonl file.

Returns

data – List of dictionaries.

Return type

list

exception MissingEnvironmentVariable[source]

Bases: Exception

Exception for missing environment variables.

class Timer(verbose=False)[source]

Bases: object

Convenience context manager timing functions and logging the results.

The order of execution is __call__, __enter__ and __exit__.

Parameters

verbose (bool) – If True, whenever process ends we print the elapsed time to standard output.

inst_time

Time of instantiation.

Type

float

name

Name of the process to be timed. The user can control the value via the __call__ magic.

Type

str or None

logs

Internal dictionary that stores all the times. The keys are the process names and the values are number of seconds.

Type

dict

start_time

Time of the last enter. Is dynamically changed when entering.

Type

float or None

Examples

>>> import time
>>> from bluesearch.utils import Timer
>>>
>>> timer = Timer(verbose=False)
>>>
>>> with timer('experiment_1'):
...     time.sleep(0.05)
>>>
>>> with timer('experiment_2'):
...     time.sleep(0.02)
>>>
>>> assert set(timer.stats.keys()) == {'overall', 'experiment_1', 'experiment_2'}
property stats

Return all timing statistics.

check_entity_type_consistency(model_path: Union[str, pathlib.Path]) bool[source]

Check that entity type of the model name is the same as in the ner pipe.

Parameters

model_path – Path to a spacy model directory.

Returns

If true, the name of the model and the entity type name detected by the model are consistent. Otherwise, it is not.

Return type

bool

get_available_spacy_models(data_and_models_dir: Union[str, pathlib.Path]) Dict[str, pathlib.Path][source]

List available spacy models for a given data directory.

Parameters

data_and_models_dir – Path to directory “data_and_models”. Should contains models/ner_er and models/er directories with all spacy models.

Returns

Dictionary mapping the entity type to the spacy model path detecting it. Only the models following the naming convention are kept.

Return type

models_dict

load_spacy_model(model_name: Union[str, pathlib.Path], *args: Any, **kwargs: Any) spacy.language.Language[source]

Spacy model load with informative error message.

Parameters
  • model_name – spaCy pipeline to load. It can be a package name or a local path.

  • *args – Arguments passed to spacy.load()

  • **kwargs – Arguments passed to spacy.load()

Returns

Loaded spaCy pipeline.

Return type

model

Raises

ModuleNotFoundError – If spaCy model loading failed due to non-existent package or local file.