Entry points¶

This section describes how to use the entry points for common operations.

Compute sentence embeddings¶

We will compute sentence embeddings:

with the model BioBERT NLI+STS CORD-19 v1
for CORD-19 version 47
using 4 GPUs

The same instructions can be applied to other models, other CORD-19 versions, and other GPU configurations. To run on a CPU please consistently remove the --gpus parameter everywhere.

Launch a Docker container with CUDA support and access to 4 GPUs:

docker run \
  -it \
  --rm \
  --volume <local_path>:<container_path>
  --user 'root' \
  --gpus '"device=0,1,2,3"' \
  --name 'embedding_computation' \
  bbs_base

Note that we use the --volume parameter to mount all local paths that should be accessible from the container, for example the output directory for the embedding file, or the path to the embedding model checkpoint.

All following commands are executed in this interactive container.

Upgrade pip:

python -m pip install --upgrade pip

Install Blue Brain Search:

pip install bluesearch

Define the path to the output HDF5 file with the embeddings:

export EMBEDDINGS=<some_path>/embeddings.h5

It is possible to write different embedding datasets to the same h5-file. If the file specified in EMBEDDINGS already exists and we’re adding a new embedding dataset, then one might consider creating a backup copy:

cp  "$EMBEDDINGS" "${EMBEDDINGS}.backup"

Launch the parallel computation of the embeddings:

compute_embeddings SentTransformer "$EMBEDDINGS" \
  --checkpoint '<biobert_model_checkpoint_path>' \
  --db-url <mysql_host>:<mysql_port>/<mysql_database> \
  --gpus '0,1,2,3' \
  --h5-dataset-name 'BioBERT NLI+STS CORD-19 v1' \
  --n-processes 4 \
  --temp-dir .

Create the MySQL database¶

Launch an interactive Docker container:

docker run \
  -it \
  --rm \
  --volume <local_path>:<container_path> \
  --user 'root' \
  --name 'database_creation' \
  bbs_base

Note that we use the --volume parameter to mount all local paths that should be accessible from the container, for example the directory with the CORD data (see below).

All following commands are executed in this interactive container.

Upgrade pip:

python -m pip install --upgrade pip

Install Blue Brain Search:

pip install bluesearch

Launch the creation of the database:

create_database --data-path <data_path>

The parameter data_path should point to the directory with the original CORD-19 data, which can be obtained from Kaggle.