Changelog¶

This page contains changelogs for Blue Brain Search released versions.

Legend¶

Add denotes new features.
Fix denotes bug fixes.
Change denotes functionality changes.
Deprecate denotes deprecated features that will be removed in the future.
Remove denotes removed features.

Latest¶

Change the file extension used to read PubMed articles from unzipped .xml to .xml.gz in bbs_database topic-extract and bbs_database parse entrypoints
Add possibility to read .xml and .meca file extensions in JATSXMLParser to parse BiorXiv and MedrXiv articles
Add the --mesh-topic-db option to bbs_database topic-extract
Add the bbs_database parse-mesh-rdf command
Add the bluesearch.database.mesh module
Change the UID of an article is computed by hashing the identifiers if those exist, otherwise by hashing the article contents.
Add entrypoint bbs_database topic-filter.
Add the bluesearch.database.topic_rule.TopicRule class
Add the bluesearch.database.topic_info.TopicInfo class
Add the bluesearch.database.article.ArticleSource enum class
Add extraction of journal and article topics for arxiv papers through CLI command bbs_database topic-extract arxiv.
Add extraction of journal and article topics for pubmed papers through CLI command bbs_database topic-extract pubmed.
Add extraction of journal topics for pmc papers through CLI command bbs_database topic-extract pmc.
Change Paper UID to take into account also the arxiv id (when available).
Change uid generation raises ValueError if all identifiers are None.
Add code to download arxiv papers from a given date.
Change the behaviour of the entrypoint bbs_database download when the specified --from-month is too old and the source changed its structure of storing articles meanwhile. Now print an error and exit.
Add code to download PMC papers from a given date.
Add entrypoint bbs_database download.
Add run the tox env check-apidoc in CI
Add tox environments apidoc and check-apidoc
Add input type tei-xml for the bbs_database parse command.
Add option --dry-run for bbs_database parse to display files to parse without parsing them.
Add option --recursive for bbs_database parse to parse files recursively.
Add option --match-filename for bbs_database parse to parse only files with a name matching a given regular expression.
Change split the CI job into smaller jobs
Change for bbs_database parse the value for input_type from pmc-xml to jats-xml.
Change name for PMCXMLParser to JATSXMLParser.
Add article parser for TEI XML files.
Add CLI subcommand bbs_database convert-pdf.
Add parsing of PDFs through a GROBID server.
Add default value None for optional fields of Article.
Add loading of metadata and abstracts from PubMed.
Fix parsing in PubMed metadata of authors with a <CollectiveName> instead of a <LastName>.
Add an ArticleParser for metadata and abstracts from PubMed.
Change the behaviour of bbs_database add when no article was loaded from the given path. Now, stop with a RuntimeWarning and don’t load the NLP model to get sentences (fail faster).
Change the behaviour of bbs_database add when no sentence was extracted from the given path, Now, stop with a RuntimeWarning.
Change serialization of processed articles from Pickle to JSON format.
Add command line entrypoints bbs_database init, bbs_database parse, and bbs_database add to initialize a literature database, parse, and integrate articles.
Add research of topic at journal and article levels in topic module.
Add PMCXMLParser to parse PubMed articles in XML JATS format.
Fix DVC pipeline named sentence_embedding regarding missing deps elements and mixed models origin.
Fix the incorrect maximum input length to the transformer model used as backbone for the NER models.
Add BioBERT NLI+STS CORD-19 v1 building script as a DVC pipeline.
Fix the incorrect maximum input length to the transformer model used as backbone for the sentence embedding model BioBERT NLI+STS CORD-19 v1.
Add deterministic generation of paper UIDs based on paper identifiers.
Change relative imports into absolute ones.
Add the tables articles and sentences for bbs_database init and bbs_database add.

Version 0.2.0¶

July 1, 2021

Add metrics file resulting from dvc pipelines to git. This allow now to use dvc metrics diff.
Change dependencies required to run the code of data_and_models/ are not installed by default and now require pip install .[data_and_models].
Add in dvc, in ner pipelines, scripts allowing to train and evaluate NER thanks to the huggingface/transformers package. A comparison with spaCy training is also possible.
Change reports format of Search Widget from PDF to HTML.
Remove tqdm, joblib, pdfkit dependencies.
Remove bluesearch.mining.eval.plot_ner_confusion_matrix function to drop joblib from install_requires.
Change requirements.txt refactored into three separate lists of dependencies: requirements.txt, requirements-dev.txt, requirements-data_and_models.txt.
Fix bugs (related to nested entities) in ner_report, ner_errors, ner_confusion_matrix functions from bluesearch.mining.eval submodule.
Add utility function _check_consistent_iob inside bluesearch.mining.eval.
Change upgrade linting tools in tox.ini
Change for Transformer-based spaCy pipelines for NER models instead of Tok2Vec-based scispaCy pipelines.
Change for one entity per model instead of several entities per NER model.
Change pipelines/ner/dvc.yaml to simplify and harmonize the definition of the pipeline for training NER models.
Add annotations/ner/analyze.py, a code to evaluate the data quality of annotations. It could generate: 1) a detailed report for individual files when used as a script and 2) a summary table for several files when used as a function.
Add pipelines/ner/clean.py, a script to clean annotations. It keeps only valid texts, normalizes labels, keeps only a given label, and then renames the label if necessary.
Remove ee_models_library.csv and change the logic for one model per entity type.
Add ArticleParser abstract class representing a generic interface for parsing articles.
Add CORD19ArticleParser to parse CORD-19 articles in JSON format.

Version 0.1.2¶

Change spaCy version from 2.x to 3.x, including scispaCy and models versions.
Change the training of NER models: use spaCy directly instead of Prodigy, use the default configuration from spaCy 3 instead of from Prodigy, use the binary format (.spacy) from spaCy 3 instead of the .jsonl format from Prodigy.
Remove Prodigy dependency.

Version 0.1.1¶

Change Upgrade to dvc 2.0.
Remove NLTK dependencies.
Change Drop the dedicated SBioBERT class, we now use SentTransformer interface to support this model.

Version 0.1.0¶

Add in dvc pipelines, the Dockerfile now installs requirements.txt to fix the versions of dependencies.
Add support for Python 3.9.
Add Blue Brain Search as a Zenodo record. This provides a unique DOI, a DOI for each published release, and automatic preservation outside GitHub.
Add the content of the DVC remote for Blue Brain Search v0.1.0 as a Zenodo record. This provides DOIs as for the code of Blue Brain Search above. This is also the first public release of the data and models of Blue Brain Search.
Remove support for Python 3.6.
Remove the external dependency sent2vec and the embedding models depending on it, i.e. BSV and Sent2VecModel.
Remove the embedding model Universal Sentence Encoder: (USE) and its dependencies (tensorflow and tensorflow-hub).
Remove BBS_BBG_poc notebook (now hosted on https://github.com/BlueBrain/Search-Graph-Examples) and assets/ directory.

Version 0.0.10¶

Changes¶

Change bluesearch is the new name of the Python package, replacing the former bbsearch.
Change The code is now hosted on GitHub under BlueBrain/Search, eliminating the redundancy of the former BlueBrain/BlueBrainSearch.
Add in README the purpose of Blue Brain Search.
Add in README the common usage of the two widgets (search and mining).
Add in README a complete and step-by-step Getting Started.
Add type checking for third-party libraries (NumPy, Pandas, SQLAlchemy).
Add BioBERT NLI+STS CORD-19 v1 to DVC evaluation pipeline.

Version 0.0.9¶

December 11, 2020

Changes¶

Add saving and loading of the results from the literature search and mining widgets.
Add mining for more than 1,000 articles.
Add BioBERT NLI+STS CORD-19 v1 training scripts and data.
Add CORD-19 version 65 database, embeddings, and entities.
Add tests for all entry points.
Add security checks with bandit.
Fix NER false positive for abstract.
Fix refactoring issue in get_embedding_model.
Change naming of and inside the bluesearch.entrypoints module.
Change how the NER entry points retrieve models: now DVC is used.
Change warnings when generating the documentation into errors.
Remove scibert from setup.py and requirements.txt.

Version 0.0.8¶

November 24, 2020

Changes¶

Add column is_bad in table sentences for quality filtering (too long, too short, LaTeX code).
Add embedding model BioBERT NLI+STS CORD-19 v1.
Change embedding_models.get_embedding_model() to support any model class and checkpoint path without having to modify the source code of BBS.
Fix bug in hyperlinks of SearchWidget. We now take the first URL if there are several, and add Google search if there is none.
Change widgets UIs with tabs to improve usability.

Version 0.0.7¶

November 16, 2020

Changes¶

Add parallelization of embedding computations.
Change “Saved Articles” summary in the Search Widget.
Fix undesired timeout of MySQL connection in the Search Server.

Version 0.0.6¶

November 3, 2020

Changes¶

Add inter-rater agreement with DVC.
Add Advanced Features section in the Search Widget.
Change mining schema logic.
Change code formatting - run black on everything.

Version 0.0.5¶

October 26, 2020

Changes¶

Change bluesearch.mining.eval.spacy2df can now work with NER pipelines including entity rulers.

Version 0.0.4¶

October 20, 2020

Changes¶

Add language detection with langdetect, allowing to filter out articles not in English or no useful content.
Add widgets inform the user on the CORD-19 version being used.
Add bluesearch.utils.JSONL for easy interaction with JSONL files.
Add bluesearch.entity.PatternCreator and other functionalities to perform rule-based named entity recognition.
Change module names
Change in bluesearch.embedding_models, SBERT class is now replaced by a more general-purpose SentTransformer which can wrap any object from sentence_transformers.SentenceTransformer.
Add bluesearch.embedding_models.SklearnVectorizer is a new class that can be used to wrap any sklearn vectorizer object (TfidfVectorizer, CountVectorizer, HashingVectorizer).

Version 0.0.3¶

October 2, 2020

This is the first beta release from Blue Brain Search.
Previous releases were highly experimental and should be considered as being in alpha phase.

Changes¶

Change CORD19 database version, upgrading from v35 to v47.
Add button to Literature Search widget to let user choose whether to retrieve top N articles or top N sentences.
Fix bug in database creation where auto-increment was triggered even if insertion failed.
Add automatic creation of a FULLTEXT INDEX on sentences.text when the table is first created, just after data insertion.
Add annotations for NER with DVC.
Add pipelines to train and evaluate NER models with DVC.
Add Sent2VecModel class and option in Literature Search widget to select sent2vec to run the search.
Add Docker ecosystem with .env files and docker-compose.
Change search servers by merging RemoteSearcher and LocalSearcher into the new SearchEngine.