Changelog

This page contains changelogs for Blue Brain Search released versions.

Legend

  • Add denotes new features.

  • Fix denotes bug fixes.

  • Change denotes functionality changes.

  • Deprecate denotes deprecated features that will be removed in the future.

  • Remove denotes removed features.

Latest

Version 0.2.0

July 1, 2021

  • Add metrics file resulting from dvc pipelines to git. This allow now to use dvc metrics diff.

  • Change dependencies required to run the code of data_and_models/ are not installed by default and now require pip install .[data_and_models].

  • Add in dvc, in ner pipelines, scripts allowing to train and evaluate NER thanks to the huggingface/transformers package. A comparison with spaCy training is also possible.

  • Change reports format of Search Widget from PDF to HTML.

  • Remove tqdm, joblib, pdfkit dependencies.

  • Remove bluesearch.mining.eval.plot_ner_confusion_matrix function to drop joblib from install_requires.

  • Change requirements.txt refactored into three separate lists of dependencies: requirements.txt, requirements-dev.txt, requirements-data_and_models.txt.

  • Fix bugs (related to nested entities) in ner_report, ner_errors, ner_confusion_matrix functions from bluesearch.mining.eval submodule.

  • Add utility function _check_consistent_iob inside bluesearch.mining.eval.

  • Change upgrade linting tools in tox.ini

  • Change for Transformer-based spaCy pipelines for NER models instead of Tok2Vec-based scispaCy pipelines.

  • Change for one entity per model instead of several entities per NER model.

  • Change pipelines/ner/dvc.yaml to simplify and harmonize the definition of the pipeline for training NER models.

  • Add annotations/ner/analyze.py, a code to evaluate the data quality of annotations. It could generate: 1) a detailed report for individual files when used as a script and 2) a summary table for several files when used as a function.

  • Add pipelines/ner/clean.py, a script to clean annotations. It keeps only valid texts, normalizes labels, keeps only a given label, and then renames the label if necessary.

  • Remove ee_models_library.csv and change the logic for one model per entity type.

Version 0.1.2

  • Change spaCy version from 2.x to 3.x, including scispaCy and models versions.

  • Change the training of NER models: use spaCy directly instead of Prodigy, use the default configuration from spaCy 3 instead of from Prodigy, use the binary format (.spacy) from spaCy 3 instead of the .jsonl format from Prodigy.

  • Remove Prodigy dependency.

Version 0.1.1

  • Change Upgrade to dvc 2.0.

  • Remove NLTK dependencies.

  • Change Drop the dedicated SBioBERT class, we now use SentTransformer interface to support this model.

Version 0.1.0

  • Add in dvc pipelines, the Dockerfile now installs requirements.txt to fix the versions of dependencies.

  • Add support for Python 3.9.

  • Add Blue Brain Search as a Zenodo record. This provides a unique DOI, a DOI for each published release, and automatic preservation outside GitHub.

  • Add the content of the DVC remote for Blue Brain Search v0.1.0 as a Zenodo record. This provides DOIs as for the code of Blue Brain Search above. This is also the first public release of the data and models of Blue Brain Search.

  • Remove support for Python 3.6.

  • Remove the external dependency sent2vec and the embedding models depending on it, i.e. BSV and Sent2VecModel.

  • Remove the embedding model Universal Sentence Encoder: (USE) and its dependencies (tensorflow and tensorflow-hub).

  • Remove BBS_BBG_poc notebook (now hosted on https://github.com/BlueBrain/Search-Graph-Examples) and assets/ directory.

Version 0.0.10

Changes

  • Change bluesearch is the new name of the Python package, replacing the former bbsearch.

  • Change The code is now hosted on GitHub under BlueBrain/Search, eliminating the redundancy of the former BlueBrain/BlueBrainSearch.

  • Add in README the purpose of Blue Brain Search.

  • Add in README the common usage of the two widgets (search and mining).

  • Add in README a complete and step-by-step Getting Started.

  • Add type checking for third-party libraries (NumPy, Pandas, SQLAlchemy).

  • Add BioBERT NLI+STS CORD-19 v1 to DVC evaluation pipeline.

Version 0.0.9

December 11, 2020

Changes

  • Add saving and loading of the results from the literature search and mining widgets.

  • Add mining for more than 1,000 articles.

  • Add BioBERT NLI+STS CORD-19 v1 training scripts and data.

  • Add CORD-19 version 65 database, embeddings, and entities.

  • Add tests for all entry points.

  • Add security checks with bandit.

  • Fix NER false positive for abstract.

  • Fix refactoring issue in get_embedding_model.

  • Change naming of and inside the bluesearch.entrypoints module.

  • Change how the NER entry points retrieve models: now DVC is used.

  • Change warnings when generating the documentation into errors.

  • Remove scibert from setup.py and requirements.txt.

Version 0.0.8

November 24, 2020

Changes

  • Add column is_bad in table sentences for quality filtering (too long, too short, LaTeX code).

  • Add embedding model BioBERT NLI+STS CORD-19 v1.

  • Change embedding_models.get_embedding_model() to support any model class and checkpoint path without having to modify the source code of BBS.

  • Fix bug in hyperlinks of SearchWidget. We now take the first URL if there are several, and add Google search if there is none.

  • Change widgets UIs with tabs to improve usability.

Version 0.0.7

November 16, 2020

Changes

  • Add parallelization of embedding computations.

  • Change “Saved Articles” summary in the Search Widget.

  • Fix undesired timeout of MySQL connection in the Search Server.

Version 0.0.6

November 3, 2020

Changes

  • Add inter-rater agreement with DVC.

  • Add Advanced Features section in the Search Widget.

  • Change mining schema logic.

  • Change code formatting - run black on everything.

Version 0.0.5

October 26, 2020

Changes

  • Change bluesearch.mining.eval.spacy2df can now work with NER pipelines including entity rulers.

Version 0.0.4

October 20, 2020

Changes

  • Add language detection with langdetect, allowing to filter out articles not in English or no useful content.

  • Add widgets inform the user on the CORD-19 version being used.

  • Add bluesearch.utils.JSONL for easy interaction with JSONL files.

  • Add bluesearch.entity.PatternCreator and other functionalities to perform rule-based named entity recognition.

  • Change module names

  • Change in bluesearch.embedding_models, SBERT class is now replaced by a more general-purpose SentTransformer which can wrap any object from sentence_transformers.SentenceTransformer.

  • Add bluesearch.embedding_models.SklearnVectorizer is a new class that can be used to wrap any sklearn vectorizer object (TfidfVectorizer, CountVectorizer, HashingVectorizer).

Version 0.0.3

October 2, 2020

  • This is the first beta release from Blue Brain Search.

  • Previous releases were highly experimental and should be considered as being in alpha phase.

Changes

  • Change CORD19 database version, upgrading from v35 to v47.

  • Add button to Literature Search widget to let user choose whether to retrieve top N articles or top N sentences.

  • Fix bug in database creation where auto-increment was triggered even if insertion failed.

  • Add automatic creation of a FULLTEXT INDEX on sentences.text when the table is first created, just after data insertion.

  • Add annotations for NER with DVC.

  • Add pipelines to train and evaluate NER models with DVC.

  • Add Sent2VecModel class and option in Literature Search widget to select sent2vec to run the search.

  • Add Docker ecosystem with .env files and docker-compose.

  • Change search servers by merging RemoteSearcher and LocalSearcher into the new SearchEngine.