bluesearch.utils module¶
Generic Utils.
- class H5[source]¶
Bases:
object
H5 utilities.
- static clear(h5_path, dataset_name, indices)[source]¶
Set selected rows to the fillvalue.
- Parameters
h5_path (pathlib.Path) – Path to the h5 file.
dataset_name (str) – Name of the dataset.
indices (np.ndarray) – 1D array that determines the rows to be set to fillvalue.
- static concatenate(h5_path_output, dataset_name, h5_paths_temp, delete_inputs=True, batch_size=2000)[source]¶
Concatenate multiple h5 files into one h5 file.
- Parameters
h5_path_output (pathlib.Path) – Path to the h5 file. Note that this file can already exist and contain other datasets.
dataset_name (str) – Name of the dataset.
h5_paths_temp (list) –
- Paths to the input h5 files. Note that each of them will have 2 datasets.
{dataset_name} - dtype = float and shape (length, dim)
{dataset_name}_indices - dtype = int and shape (length, 1)
delete_inputs (bool) – If True, then all input h5 files are deleted once the concatenation is done.
batch_size (int) – Batch size to be used for transfers from the input h5 to the final one.
- static create(h5_path, dataset_name, shape, dtype='f4')[source]¶
Create a dataset (and potentially also a h5 file).
- Parameters
h5_path (pathlib.Path) – Path to the h5 file.
dataset_name (str) – Name of the dataset.
shape (tuple of int) – Two element tuple representing rows and columns.
dtype (str) – Dtype of the h5 array. See references for all the details.
Notes
Unpopulated rows will be filled with np.nan.
References
- static find_populated_rows(h5_path, dataset_name, batch_size=2000, verbose=False)[source]¶
Identify rows that are populated (= not nan vectors).
- Parameters
h5_path (pathlib.Path) – Path to the h5 file.
dataset_name (str) – Name of the dataset.
batch_size (int) – Number of rows to be loaded at a time.
verbose (bool) – Controls verbosity.
- Returns
pop_rows – 1D numpy array of ints representing row indices of populated rows (not nan).
- Return type
np.ndarray
- static find_unpopulated_rows(h5_path, dataset_name, batch_size=2000, verbose=False)[source]¶
Return the indices of rows that are unpopulated.
- Parameters
h5_path (pathlib.Path) – Path to the h5 file.
dataset_name (str) – Name of the dataset.
batch_size (int) – Number of rows to be loaded at a time.
verbose (bool) – Controls verbosity.
- Returns
unpop_rows – 1D numpy array of ints representing row indices of unpopulated rows (nan).
- Return type
np.ndarray
- static get_shape(h5_path, dataset_name)[source]¶
Get shape of a dataset.
- Parameters
h5_path (pathlib.Path) – Path to the h5 file.
dataset_name (str) – Name of the dataset.
- static load(h5_path, dataset_name, batch_size=500, indices=None, verbose=False)[source]¶
Load an h5 file in memory.
- Parameters
h5_path (pathlib.Path) – Path to the h5 file.
dataset_name (str) – Name of the dataset.
batch_size (int) – Number of rows to be loaded at a time.
indices (None or np.ndarray) – If None then we load all the rows from the dataset. If
np.ndarray
then the loading only selected indices.verbose (bool) – Controls verbosity.
- Returns
res – Numpy array of shape (len(indices), …) holding the loaded rows.
- Return type
np.ndarray
- static write(h5_path, dataset_name, data, indices)[source]¶
Write a numpy array into an h5 file.
- Parameters
h5_path (pathlib.Path) – Path to the h5 file.
dataset_name (str) – Name of the dataset.
data (np.ndarray) – 2D numpy array to be written into the h5 file.
indices (np.ndarray) – 1D numpy array that determines row indices whre the data pasted.
- class JSONL[source]¶
Bases:
object
Collection of utility static functions handling jsonl files.
- exception MissingEnvironmentVariable[source]¶
Bases:
Exception
Exception for missing environment variables.
- class Timer(verbose=False)[source]¶
Bases:
object
Convenience context manager timing functions and logging the results.
The order of execution is __call__, __enter__ and __exit__.
- Parameters
verbose (bool) – If True, whenever process ends we print the elapsed time to standard output.
- inst_time¶
Time of instantiation.
- Type
float
- name¶
Name of the process to be timed. The user can control the value via the __call__ magic.
- Type
str or None
- logs¶
Internal dictionary that stores all the times. The keys are the process names and the values are number of seconds.
- Type
dict
- start_time¶
Time of the last enter. Is dynamically changed when entering.
- Type
float or None
Examples
>>> import time >>> from bluesearch.utils import Timer >>> >>> timer = Timer(verbose=False) >>> >>> with timer('experiment_1'): ... time.sleep(0.05) >>> >>> with timer('experiment_2'): ... time.sleep(0.02) >>> >>> assert set(timer.stats.keys()) == {'overall', 'experiment_1', 'experiment_2'}
- property stats¶
Return all timing statistics.
- check_entity_type_consistency(model_path: Union[str, pathlib.Path]) bool [source]¶
Check that entity type of the model name is the same as in the ner pipe.
- Parameters
model_path – Path to a spacy model directory.
- Returns
If true, the name of the model and the entity type name detected by the model are consistent. Otherwise, it is not.
- Return type
bool
- get_available_spacy_models(data_and_models_dir: Union[str, pathlib.Path]) Dict[str, pathlib.Path] [source]¶
List available spacy models for a given data directory.
- Parameters
data_and_models_dir – Path to directory “data_and_models”. Should contains models/ner_er and models/er directories with all spacy models.
- Returns
Dictionary mapping the entity type to the spacy model path detecting it. Only the models following the naming convention are kept.
- Return type
models_dict
- load_spacy_model(model_name: Union[str, pathlib.Path], *args: Any, **kwargs: Any) spacy.language.Language [source]¶
Spacy model load with informative error message.
- Parameters
model_name – spaCy pipeline to load. It can be a package name or a local path.
*args – Arguments passed to spacy.load()
**kwargs – Arguments passed to spacy.load()
- Returns
Loaded spaCy pipeline.
- Return type
model
- Raises
ModuleNotFoundError – If spaCy model loading failed due to non-existent package or local file.