bluesearch.utils module¶

Generic Utils.

Bases: object

H5 utilities.

static clear(h5_path, dataset_name, indices)[source]¶

Set selected rows to the fillvalue.

Parameters

h5_path (pathlib.Path) – Path to the h5 file.
dataset_name (str) – Name of the dataset.
indices (np.ndarray) – 1D array that determines the rows to be set to fillvalue.

static concatenate(h5_path_output, dataset_name, h5_paths_temp, delete_inputs=True, batch_size=2000)[source]¶

Concatenate multiple h5 files into one h5 file.

Parameters

h5_path_output (pathlib.Path) – Path to the h5 file. Note that this file can already exist and contain other datasets.
dataset_name (str) – Name of the dataset.
h5_paths_temp (list) –
Paths to the input h5 files. Note that each of them will have 2 datasets.
- {dataset_name} - dtype = float and shape (length, dim)
- {dataset_name}_indices - dtype = int and shape (length, 1)
delete_inputs (bool) – If True, then all input h5 files are deleted once the concatenation is done.
batch_size (int) – Batch size to be used for transfers from the input h5 to the final one.

static create(h5_path, dataset_name, shape, dtype='f4')[source]¶

Create a dataset (and potentially also a h5 file).

Parameters

h5_path (pathlib.Path) – Path to the h5 file.
dataset_name (str) – Name of the dataset.
shape (tuple of int) – Two element tuple representing rows and columns.
dtype (str) – Dtype of the h5 array. See references for all the details.

Notes

Unpopulated rows will be filled with np.nan.

References

[1] http://docs.h5py.org/en/stable/faq.html#faq

static find_populated_rows(h5_path, dataset_name, batch_size=2000, verbose=False)[source]¶

Identify rows that are populated (= not nan vectors).

Parameters

h5_path (pathlib.Path) – Path to the h5 file.
dataset_name (str) – Name of the dataset.
batch_size (int) – Number of rows to be loaded at a time.
verbose (bool) – Controls verbosity.

Returns

pop_rows – 1D numpy array of ints representing row indices of populated rows (not nan).

Return type

np.ndarray

static find_unpopulated_rows(h5_path, dataset_name, batch_size=2000, verbose=False)[source]¶

Return the indices of rows that are unpopulated.

Parameters

h5_path (pathlib.Path) – Path to the h5 file.
dataset_name (str) – Name of the dataset.
batch_size (int) – Number of rows to be loaded at a time.
verbose (bool) – Controls verbosity.

Returns

unpop_rows – 1D numpy array of ints representing row indices of unpopulated rows (nan).

Return type

np.ndarray

static get_shape(h5_path, dataset_name)[source]¶

Get shape of a dataset.

Parameters

h5_path (pathlib.Path) – Path to the h5 file.
dataset_name (str) – Name of the dataset.

static load(h5_path, dataset_name, batch_size=500, indices=None, verbose=False)[source]¶

Load an h5 file in memory.

Parameters

h5_path (pathlib.Path) – Path to the h5 file.
dataset_name (str) – Name of the dataset.
batch_size (int) – Number of rows to be loaded at a time.
indices (None or np.ndarray) – If None then we load all the rows from the dataset. If np.ndarray then the loading only selected indices.
verbose (bool) – Controls verbosity.

Returns

res – Numpy array of shape (len(indices), …) holding the loaded rows.

Return type

np.ndarray

static write(h5_path, dataset_name, data, indices)[source]¶

Write a numpy array into an h5 file.

Parameters

h5_path (pathlib.Path) – Path to the h5 file.
dataset_name (str) – Name of the dataset.
data (np.ndarray) – 2D numpy array to be written into the h5 file.
indices (np.ndarray) – 1D numpy array that determines row indices whre the data pasted.

class JSONL[source]¶

Bases: object

Collection of utility static functions handling jsonl files.

static dump_jsonl(data, path)[source]¶

Save a list of dictionaries to a jsonl.

Parameters

data (list) – List of dictionaries (json files).
path (pathlib.Path) – File where to save it.

static load_jsonl(path)[source]¶

Read jsonl into a list of dictionaries.

Parameters: path (pathlib.Path) – Path to the .jsonl file.
Returns: data – List of dictionaries.
Return type: list

exception MissingEnvironmentVariable[source]¶

Bases: Exception

Exception for missing environment variables.

class Timer(verbose=False)[source]¶

Bases: object

Convenience context manager timing functions and logging the results.

The order of execution is __call__, __enter__ and __exit__.

Parameters: verbose (bool) – If True, whenever process ends we print the elapsed time to standard output.

inst_time¶

Time of instantiation.

Type: float

name¶

Name of the process to be timed. The user can control the value via the __call__ magic.

Type: str or None

logs¶

Internal dictionary that stores all the times. The keys are the process names and the values are number of seconds.

Type: dict

start_time¶

Time of the last enter. Is dynamically changed when entering.

Type: float or None

Examples

>>> import time
>>> from bluesearch.utils import Timer
>>>
>>> timer = Timer(verbose=False)
>>>
>>> with timer('experiment_1'):
...     time.sleep(0.05)
>>>
>>> with timer('experiment_2'):
...     time.sleep(0.02)
>>>
>>> assert set(timer.stats.keys()) == {'overall', 'experiment_1', 'experiment_2'}

property stats¶: Return all timing statistics.

check_entity_type_consistency(model_path: Union[str, pathlib.Path]) → bool[source]¶

Check that entity type of the model name is the same as in the ner pipe.

Parameters: model_path – Path to a spacy model directory.
Returns: If true, the name of the model and the entity type name detected by the model are consistent. Otherwise, it is not.
Return type: bool

get_available_spacy_models(data_and_models_dir: Union[str, pathlib.Path]) → Dict[str, pathlib.Path][source]¶

List available spacy models for a given data directory.

Parameters: data_and_models_dir – Path to directory “data_and_models”. Should contains models/ner_er and models/er directories with all spacy models.
Returns: Dictionary mapping the entity type to the spacy model path detecting it. Only the models following the naming convention are kept.
Return type: models_dict

load_spacy_model(model_name: Union[str, pathlib.Path], *args: Any, **kwargs: Any) → spacy.language.Language[source]¶

Spacy model load with informative error message.

Parameters

model_name – spaCy pipeline to load. It can be a package name or a local path.
*args – Arguments passed to spacy.load()
**kwargs – Arguments passed to spacy.load()

Returns

Loaded spaCy pipeline.

Return type

model

Raises

ModuleNotFoundError – If spaCy model loading failed due to non-existent package or local file.