Changelog¶
This page contains changelogs for Blue Brain Search released versions.
Legend¶
Add denotes new features.
Fix denotes bug fixes.
Change denotes functionality changes.
Deprecate denotes deprecated features that will be removed in the future.
Remove denotes removed features.
Latest¶
Change the file extension used to read
PubMedarticles from unzipped .xml to .xml.gz inbbs_database topic-extractandbbs_database parseentrypointsAdd possibility to read .xml and .meca file extensions in
JATSXMLParserto parse BiorXiv and MedrXiv articlesAdd the
--mesh-topic-dboption tobbs_database topic-extractAdd the
bbs_database parse-mesh-rdfcommandAdd the
bluesearch.database.meshmoduleChange the UID of an article is computed by hashing the identifiers if those exist, otherwise by hashing the article contents.
Add entrypoint
bbs_database topic-filter.Add the
bluesearch.database.topic_rule.TopicRuleclassAdd the
bluesearch.database.topic_info.TopicInfoclassAdd the
bluesearch.database.article.ArticleSourceenum classAdd extraction of journal and article topics for
arxivpapers through CLI commandbbs_database topic-extract arxiv.Add extraction of journal and article topics for
pubmedpapers through CLI commandbbs_database topic-extract pubmed.Add extraction of journal topics for
pmcpapers through CLI commandbbs_database topic-extract pmc.Change Paper UID to take into account also the
arxiv id(when available).Change
uidgeneration raisesValueErrorif all identifiers areNone.Add code to download
arxivpapers from a given date.Change the behaviour of the entrypoint
bbs_database downloadwhen the specified--from-monthis too old and the source changed its structure of storing articles meanwhile. Now print an error and exit.Add code to download
PMCpapers from a given date.Add entrypoint
bbs_database download.Add run the tox env
check-apidocin CIAdd tox environments
apidocandcheck-apidocAdd input type
tei-xmlfor thebbs_database parsecommand.Add option
--dry-runforbbs_database parseto display files to parse without parsing them.Add option
--recursiveforbbs_database parseto parse files recursively.Add option
--match-filenameforbbs_database parseto parse only files with a name matching a given regular expression.Change split the CI job into smaller jobs
Change for
bbs_database parsethe value forinput_typefrompmc-xmltojats-xml.Change name for
PMCXMLParsertoJATSXMLParser.Add article parser for TEI XML files.
Add CLI subcommand
bbs_database convert-pdf.Add parsing of PDFs through a GROBID server.
Add default value
Nonefor optional fields ofArticle.Add loading of metadata and abstracts from
PubMed.Fix parsing in
PubMedmetadata of authors with a<CollectiveName>instead of a<LastName>.Add an
ArticleParserfor metadata and abstracts fromPubMed.Change the behaviour of
bbs_database addwhen no article was loaded from the given path. Now, stop with aRuntimeWarningand don’t load the NLP model to get sentences (fail faster).Change the behaviour of
bbs_database addwhen no sentence was extracted from the given path, Now, stop with aRuntimeWarning.Change serialization of processed articles from Pickle to JSON format.
Add command line entrypoints
bbs_database init,bbs_database parse, andbbs_database addto initialize a literature database, parse, and integrate articles.Add research of topic at journal and article levels in
topicmodule.Add
PMCXMLParserto parse PubMed articles in XML JATS format.Fix DVC pipeline named
sentence_embeddingregarding missingdepselements and mixed models origin.Fix the incorrect maximum input length to the transformer model used as backbone for the NER models.
Add
BioBERT NLI+STS CORD-19 v1building script as a DVC pipeline.Fix the incorrect maximum input length to the transformer model used as backbone for the sentence embedding model
BioBERT NLI+STS CORD-19 v1.Add deterministic generation of paper UIDs based on paper identifiers.
Change relative imports into absolute ones.
Add the tables
articlesandsentencesforbbs_database initandbbs_database add.
Version 0.2.0¶
July 1, 2021
Add metrics file resulting from
dvcpipelines togit. This allow now to usedvc metrics diff.Change dependencies required to run the code of
data_and_models/are not installed by default and now requirepip install .[data_and_models].Add in
dvc, innerpipelines, scripts allowing to train and evaluate NER thanks to thehuggingface/transformerspackage. A comparison withspaCytraining is also possible.Change reports format of Search Widget from PDF to HTML.
Remove
tqdm,joblib,pdfkitdependencies.Remove
bluesearch.mining.eval.plot_ner_confusion_matrixfunction to dropjoblibfrominstall_requires.Change
requirements.txtrefactored into three separate lists of dependencies:requirements.txt,requirements-dev.txt,requirements-data_and_models.txt.Fix bugs (related to nested entities) in
ner_report,ner_errors,ner_confusion_matrixfunctions frombluesearch.mining.evalsubmodule.Add utility function
_check_consistent_iobinsidebluesearch.mining.eval.Change upgrade linting tools in
tox.iniChange for Transformer-based
spaCypipelines for NER models instead of Tok2Vec-basedscispaCypipelines.Change for one entity per model instead of several entities per NER model.
Change
pipelines/ner/dvc.yamlto simplify and harmonize the definition of the pipeline for training NER models.Add
annotations/ner/analyze.py, a code to evaluate the data quality of annotations. It could generate: 1) a detailed report for individual files when used as a script and 2) a summary table for several files when used as a function.Add
pipelines/ner/clean.py, a script to clean annotations. It keeps only valid texts, normalizes labels, keeps only a given label, and then renames the label if necessary.Remove
ee_models_library.csvand change the logic for one model per entity type.Add
ArticleParserabstract class representing a generic interface for parsing articles.Add
CORD19ArticleParserto parse CORD-19 articles in JSON format.
Version 0.1.2¶
Change spaCy version from 2.x to 3.x, including scispaCy and models versions.
Change the training of NER models: use spaCy directly instead of Prodigy, use the default configuration from spaCy 3 instead of from Prodigy, use the binary format (
.spacy) from spaCy 3 instead of the.jsonlformat from Prodigy.Remove Prodigy dependency.
Version 0.1.1¶
Change Upgrade to
dvc 2.0.Remove NLTK dependencies.
Change Drop the dedicated
SBioBERTclass, we now useSentTransformerinterface to support this model.
Version 0.1.0¶
Add in
dvcpipelines, theDockerfilenow installs requirements.txt to fix the versions of dependencies.Add support for
Python 3.9.Add Blue Brain Search as a Zenodo record. This provides a unique DOI, a DOI for each published release, and automatic preservation outside GitHub.
Add the content of the DVC remote for Blue Brain Search v0.1.0 as a Zenodo record. This provides DOIs as for the code of Blue Brain Search above. This is also the first public release of the data and models of Blue Brain Search.
Remove support for
Python 3.6.Remove the external dependency
sent2vecand the embedding models depending on it, i.e.BSVandSent2VecModel.Remove the embedding model
Universal Sentence Encoder: (USE) and its dependencies (tensorflowandtensorflow-hub).Remove
BBS_BBG_pocnotebook (now hosted on https://github.com/BlueBrain/Search-Graph-Examples) andassets/directory.
Version 0.0.10¶
Changes¶
Change
bluesearchis the new name of the Python package, replacing the formerbbsearch.Change The code is now hosted on GitHub under
BlueBrain/Search, eliminating the redundancy of the formerBlueBrain/BlueBrainSearch.Add in README the purpose of Blue Brain Search.
Add in README the common usage of the two widgets (search and mining).
Add in README a complete and step-by-step Getting Started.
Add type checking for third-party libraries (
NumPy,Pandas,SQLAlchemy).Add
BioBERT NLI+STS CORD-19 v1to DVC evaluation pipeline.
Version 0.0.9¶
December 11, 2020
Changes¶
Add saving and loading of the results from the literature search and mining widgets.
Add mining for more than 1,000 articles.
Add
BioBERT NLI+STS CORD-19 v1training scripts and data.Add CORD-19 version 65 database, embeddings, and entities.
Add tests for all entry points.
Add security checks with
bandit.Fix NER false positive for
abstract.Fix refactoring issue in
get_embedding_model.Change naming of and inside the
bluesearch.entrypointsmodule.Change how the NER entry points retrieve models: now DVC is used.
Change warnings when generating the documentation into errors.
Remove
scibertfromsetup.pyandrequirements.txt.
Version 0.0.8¶
November 24, 2020
Changes¶
Add column is_bad in table sentences for quality filtering (too long, too short, LaTeX code).
Add embedding model BioBERT NLI+STS CORD-19 v1.
Change embedding_models.get_embedding_model() to support any model class and checkpoint path without having to modify the source code of BBS.
Fix bug in hyperlinks of SearchWidget. We now take the first URL if there are several, and add Google search if there is none.
Change widgets UIs with tabs to improve usability.
Version 0.0.7¶
November 16, 2020
Changes¶
Add parallelization of embedding computations.
Change “Saved Articles” summary in the Search Widget.
Fix undesired timeout of MySQL connection in the Search Server.
Version 0.0.6¶
November 3, 2020
Changes¶
Add inter-rater agreement with DVC.
Add Advanced Features section in the Search Widget.
Change mining schema logic.
Change code formatting - run black on everything.
Version 0.0.5¶
October 26, 2020
Changes¶
Change bluesearch.mining.eval.spacy2df can now work with NER pipelines including entity rulers.
Version 0.0.4¶
October 20, 2020
Changes¶
Add language detection with langdetect, allowing to filter out articles not in English or no useful content.
Add widgets inform the user on the CORD-19 version being used.
Add bluesearch.utils.JSONL for easy interaction with JSONL files.
Add bluesearch.entity.PatternCreator and other functionalities to perform rule-based named entity recognition.
Change module names
Change in bluesearch.embedding_models, SBERT class is now replaced by a more general-purpose SentTransformer which can wrap any object from sentence_transformers.SentenceTransformer.
Add bluesearch.embedding_models.SklearnVectorizer is a new class that can be used to wrap any sklearn vectorizer object (TfidfVectorizer, CountVectorizer, HashingVectorizer).
Version 0.0.3¶
October 2, 2020
This is the first beta release from Blue Brain Search.
Previous releases were highly experimental and should be considered as being in alpha phase.
Changes¶
Change CORD19 database version, upgrading from v35 to v47.
Add button to Literature Search widget to let user choose whether to retrieve top N articles or top N sentences.
Fix bug in database creation where auto-increment was triggered even if insertion failed.
Add automatic creation of a FULLTEXT INDEX on sentences.text when the table is first created, just after data insertion.
Add annotations for NER with DVC.
Add pipelines to train and evaluate NER models with DVC.
Add Sent2VecModel class and option in Literature Search widget to select sent2vec to run the search.
Add Docker ecosystem with .env files and docker-compose.
Change search servers by merging RemoteSearcher and LocalSearcher into the new SearchEngine.