Entry points¶
This section describes how to use the entry points for common operations.
Compute sentence embeddings¶
We will compute sentence embeddings:
with the model BioBERT NLI+STS CORD-19 v1
for CORD-19 version 47
using 4 GPUs
The same instructions can be applied to other models, other CORD-19 versions, and
other GPU configurations. To run on a CPU please consistently remove the --gpus
parameter everywhere.
Launch a Docker container with CUDA support and access to 4 GPUs:
docker run \
-it \
--rm \
--volume <local_path>:<container_path>
--user 'root' \
--gpus '"device=0,1,2,3"' \
--name 'embedding_computation' \
bbs_base
Note that we use the --volume
parameter to mount all local paths that should be accessible
from the container, for example the output directory for the embedding file, or the path to
the embedding model checkpoint.
All following commands are executed in this interactive container.
Upgrade pip
:
python -m pip install --upgrade pip
Install Blue Brain Search:
pip install bluesearch
Define the path to the output HDF5 file with the embeddings:
export EMBEDDINGS=<some_path>/embeddings.h5
It is possible to write different embedding datasets to the same h5-file. If the file specified
in EMBEDDINGS
already exists and we’re adding a new embedding dataset, then one might consider
creating a backup copy:
cp "$EMBEDDINGS" "${EMBEDDINGS}.backup"
Launch the parallel computation of the embeddings:
compute_embeddings SentTransformer "$EMBEDDINGS" \
--checkpoint '<biobert_model_checkpoint_path>' \
--db-url <mysql_host>:<mysql_port>/<mysql_database> \
--gpus '0,1,2,3' \
--h5-dataset-name 'BioBERT NLI+STS CORD-19 v1' \
--n-processes 4 \
--temp-dir .
Create the MySQL database¶
Launch an interactive Docker container:
docker run \
-it \
--rm \
--volume <local_path>:<container_path> \
--user 'root' \
--name 'database_creation' \
bbs_base
Note that we use the --volume
parameter to mount all local paths that should be accessible
from the container, for example the directory with the CORD data (see below).
All following commands are executed in this interactive container.
Upgrade pip
:
python -m pip install --upgrade pip
Install Blue Brain Search:
pip install bluesearch
Launch the creation of the database:
create_database --data-path <data_path>
The parameter data_path
should point to the directory with the original CORD-19 data,
which can be obtained from
Kaggle.