Changelog¶

This page contains changelogs for Blue Brain Search released versions.

Legend¶

Add denotes new features.
Fix denotes bug fixes.
Change denotes functionality changes.
Deprecate denotes deprecated features that will be removed in the future.
Remove denotes removed features.

Latest¶

Version 0.2.0¶

July 1, 2021

Add metrics file resulting from dvc pipelines to git. This allow now to use dvc metrics diff.
Change dependencies required to run the code of data_and_models/ are not installed by default and now require pip install .[data_and_models].
Add in dvc, in ner pipelines, scripts allowing to train and evaluate NER thanks to the huggingface/transformers package. A comparison with spaCy training is also possible.
Change reports format of Search Widget from PDF to HTML.
Remove tqdm, joblib, pdfkit dependencies.
Remove bluesearch.mining.eval.plot_ner_confusion_matrix function to drop joblib from install_requires.
Change requirements.txt refactored into three separate lists of dependencies: requirements.txt, requirements-dev.txt, requirements-data_and_models.txt.
Fix bugs (related to nested entities) in ner_report, ner_errors, ner_confusion_matrix functions from bluesearch.mining.eval submodule.
Add utility function _check_consistent_iob inside bluesearch.mining.eval.
Change upgrade linting tools in tox.ini
Change for Transformer-based spaCy pipelines for NER models instead of Tok2Vec-based scispaCy pipelines.
Change for one entity per model instead of several entities per NER model.
Change pipelines/ner/dvc.yaml to simplify and harmonize the definition of the pipeline for training NER models.
Add annotations/ner/analyze.py, a code to evaluate the data quality of annotations. It could generate: 1) a detailed report for individual files when used as a script and 2) a summary table for several files when used as a function.
Add pipelines/ner/clean.py, a script to clean annotations. It keeps only valid texts, normalizes labels, keeps only a given label, and then renames the label if necessary.
Remove ee_models_library.csv and change the logic for one model per entity type.

Version 0.1.2¶

Change spaCy version from 2.x to 3.x, including scispaCy and models versions.
Change the training of NER models: use spaCy directly instead of Prodigy, use the default configuration from spaCy 3 instead of from Prodigy, use the binary format (.spacy) from spaCy 3 instead of the .jsonl format from Prodigy.
Remove Prodigy dependency.

Version 0.1.1¶

Change Upgrade to dvc 2.0.
Remove NLTK dependencies.
Change Drop the dedicated SBioBERT class, we now use SentTransformer interface to support this model.

Version 0.1.0¶

Add in dvc pipelines, the Dockerfile now installs requirements.txt to fix the versions of dependencies.
Add support for Python 3.9.
Add Blue Brain Search as a Zenodo record. This provides a unique DOI, a DOI for each published release, and automatic preservation outside GitHub.
Add the content of the DVC remote for Blue Brain Search v0.1.0 as a Zenodo record. This provides DOIs as for the code of Blue Brain Search above. This is also the first public release of the data and models of Blue Brain Search.
Remove support for Python 3.6.
Remove the external dependency sent2vec and the embedding models depending on it, i.e. BSV and Sent2VecModel.
Remove the embedding model Universal Sentence Encoder: (USE) and its dependencies (tensorflow and tensorflow-hub).
Remove BBS_BBG_poc notebook (now hosted on https://github.com/BlueBrain/Search-Graph-Examples) and assets/ directory.

Version 0.0.10¶

Changes¶

Change bluesearch is the new name of the Python package, replacing the former bbsearch.
Change The code is now hosted on GitHub under BlueBrain/Search, eliminating the redundancy of the former BlueBrain/BlueBrainSearch.
Add in README the purpose of Blue Brain Search.
Add in README the common usage of the two widgets (search and mining).
Add in README a complete and step-by-step Getting Started.
Add type checking for third-party libraries (NumPy, Pandas, SQLAlchemy).
Add BioBERT NLI+STS CORD-19 v1 to DVC evaluation pipeline.

Version 0.0.9¶

December 11, 2020

Changes¶

Add saving and loading of the results from the literature search and mining widgets.
Add mining for more than 1,000 articles.
Add BioBERT NLI+STS CORD-19 v1 training scripts and data.
Add CORD-19 version 65 database, embeddings, and entities.
Add tests for all entry points.
Add security checks with bandit.
Fix NER false positive for abstract.
Fix refactoring issue in get_embedding_model.
Change naming of and inside the bluesearch.entrypoints module.
Change how the NER entry points retrieve models: now DVC is used.
Change warnings when generating the documentation into errors.
Remove scibert from setup.py and requirements.txt.

Version 0.0.8¶

November 24, 2020

Changes¶

Add column is_bad in table sentences for quality filtering (too long, too short, LaTeX code).
Add embedding model BioBERT NLI+STS CORD-19 v1.
Change embedding_models.get_embedding_model() to support any model class and checkpoint path without having to modify the source code of BBS.
Fix bug in hyperlinks of SearchWidget. We now take the first URL if there are several, and add Google search if there is none.
Change widgets UIs with tabs to improve usability.

Version 0.0.7¶

November 16, 2020

Changes¶

Add parallelization of embedding computations.
Change “Saved Articles” summary in the Search Widget.
Fix undesired timeout of MySQL connection in the Search Server.

Version 0.0.6¶

November 3, 2020

Changes¶

Add inter-rater agreement with DVC.
Add Advanced Features section in the Search Widget.
Change mining schema logic.
Change code formatting - run black on everything.

Version 0.0.5¶

October 26, 2020

Changes¶

Change bluesearch.mining.eval.spacy2df can now work with NER pipelines including entity rulers.

Version 0.0.4¶

October 20, 2020

Changes¶

Add language detection with langdetect, allowing to filter out articles not in English or no useful content.
Add widgets inform the user on the CORD-19 version being used.
Add bluesearch.utils.JSONL for easy interaction with JSONL files.
Add bluesearch.entity.PatternCreator and other functionalities to perform rule-based named entity recognition.
Change module names
Change in bluesearch.embedding_models, SBERT class is now replaced by a more general-purpose SentTransformer which can wrap any object from sentence_transformers.SentenceTransformer.
Add bluesearch.embedding_models.SklearnVectorizer is a new class that can be used to wrap any sklearn vectorizer object (TfidfVectorizer, CountVectorizer, HashingVectorizer).

Version 0.0.3¶

October 2, 2020

This is the first beta release from Blue Brain Search.
Previous releases were highly experimental and should be considered as being in alpha phase.

Changes¶

Change CORD19 database version, upgrading from v35 to v47.
Add button to Literature Search widget to let user choose whether to retrieve top N articles or top N sentences.
Fix bug in database creation where auto-increment was triggered even if insertion failed.
Add automatic creation of a FULLTEXT INDEX on sentences.text when the table is first created, just after data insertion.
Add annotations for NER with DVC.
Add pipelines to train and evaluate NER models with DVC.
Add Sent2VecModel class and option in Literature Search widget to select sent2vec to run the search.
Add Docker ecosystem with .env files and docker-compose.
Change search servers by merging RemoteSearcher and LocalSearcher into the new SearchEngine.