bluesearch.database.article module

Abstraction of scientific article data and related tools.

class Article(title: str, authors: Sequence[str], abstract: Sequence[str], section_paragraphs: Sequence[Tuple[str, str]], pubmed_id: Optional[str] = None, pmc_id: Optional[str] = None, arxiv_id: Optional[str] = None, doi: Optional[str] = None, uid: Optional[str] = None)[source]

Bases: mashumaro.mixins.json.DataClassJSONMixin

Abstraction of a scientific article and its contents.

abstract: Sequence[str]
arxiv_id: Optional[str] = None
authors: Sequence[str]
doi: Optional[str] = None
classmethod from_dict(d)
iter_paragraphs(with_abstract: bool = False) Generator[tuple[str, str], None, None][source]

Iterate over all paragraphs in the article.

Parameters

with_abstract (bool) – If true the abstract paragraphs will be included at the beginning.

Yields
  • str – Section title of the section the paragraph is in.

  • str – The paragraph text.

classmethod parse(parser: bluesearch.database.article.ArticleParser) bluesearch.database.article.Article[source]

Parse an article through a parser.

Parameters

parser – An article parser instance.

pmc_id: Optional[str] = None
pubmed_id: Optional[str] = None
section_paragraphs: Sequence[Tuple[str, str]]
title: str
to_dict()
uid: Optional[str] = None
class ArticleParser[source]

Bases: abc.ABC

An abstract base class for article parsers.

abstract property abstract: Iterable[str]

Get a sequence of paragraphs in the article abstract.

Returns

The paragraphs of the article abstract.

Return type

iterable of str

property arxiv_id: str | None

Get arXiv ID.

Returns

arXiv ID if specified, otherwise None.

Return type

str or None

abstract property authors: Iterable[str]

Get all author names.

Returns

All authors.

Return type

iterable of str

property doi: str | None

Get DOI.

Returns

DOI if specified, otherwise None.

Return type

str or None

static get_uid_from_identifiers(identifiers: tuple[str | None, ...]) str[source]

Generate a deterministic UID for a list of given paper identifiers.

Papers with the same values for the given identifiers get the same UID.

Missing values should have the value None, which is considered a value by itself. Then, identifiers (a, None) and identifiers (a, b) have two different UIDs.

Parameters

identifiers – Values of the identifiers.

Returns

A deterministic UID computed from the identifiers.

Return type

str

Raises

ValueError – If all identifiers are None.

abstract property paragraphs: Iterable[tuple[str, str]]

Get all paragraphs and titles of sections they are part of.

Returns

For each paragraph a tuple with two strings is returned. The first is the section title, the second the paragraph content.

Return type

iterable of (str, str)

property pmc_id: str | None

Get PMC ID.

Returns

PMC ID if specified, otherwise None.

Return type

str or None

property pubmed_id: str | None

Get Pubmed ID.

Returns

Pubmed ID if specified, otherwise None.

Return type

str or None

abstract property title: str

Get the article title.

Returns

The article title.

Return type

str

property uid: str

Generate deterministic UID for an article.

The UID is usually created by hashing the identifiers of the article. If no identifier is available, then the unique ID is computed by hashing the whole content of the article.

Returns

A deterministic UID.

Return type

str

class ArticleSource(value)[source]

Bases: enum.Enum

The source of an article.

ARXIV = 'arxiv'
BIORXIV = 'biorxiv'
MEDRXIV = 'medrxiv'
PMC = 'pmc'
PUBMED = 'pubmed'
UNKNOWN = 'unknown'
class CORD19ArticleParser(json_file: dict)[source]

Bases: bluesearch.database.article.ArticleParser

Parser for CORD-19 JSON files.

Parameters

json_file – The contents of a JSON-file from the CORD-19 database.

property abstract: list[str]

Get a sequence of paragraphs in the article abstract.

Returns

The paragraphs of the article abstract.

Return type

list of str

property authors: Generator[str, None, None]

Get all author names.

Yields

str – Every author.

property paragraphs: Generator[tuple[str, str], None, None]

Get all paragraphs and titles of sections they are part of.

Yields
  • str – The section title.

  • str – The paragraph content.

property pmc_id: str | None

Get PMC ID.

Returns

PMC ID if specified, otherwise None.

Return type

str or None

property title: str

Get the article title.

Returns

The article title.

Return type

str

class JATSXMLParser(xml_stream: IO)[source]

Bases: bluesearch.database.article.ArticleParser

Parser for JATS XML files.

This could be used for articles from PubMed Central, bioRxiv, and medRxiv.

Parameters

xml_stream – The xml stream of the article.

property abstract: Generator[str, None, None]

Get a sequence of paragraphs in the article abstract.

Yields

str – The paragraphs of the article abstract.

property authors: Generator[str, None, None]

Get all author names.

Yields

str – Every author, in the format “Given_Name(s) Surname”.

property doi: str | None

Get DOI.

Returns

DOI if specified, otherwise None.

Return type

str or None

classmethod from_string(xml_string: str) bluesearch.database.article.JATSXMLParser[source]

Read xml string and instantiate JATSXML Parser.

Parameters

xml_string – Raw content of the article

Returns

Parser containing the article content.

Return type

JATSXMLParser

classmethod from_xml(path: str | Path) JATSXMLParser[source]

Read xml file and instantiate JATSXML Parser.

Parameters

path – Path to the article (with .xml extension)

Returns

Parser containing the article content.

Return type

JATSXMLParser

classmethod from_zip(path: str | Path) JATSXMLParser[source]

Read xml file from a zipped .meca folder and instantiate JATSXML Parser.

Parameters

path – Path to the article (with .meca extension)

Returns

Parser containing the article content.

Return type

JATSXMLParser

get_ids() dict[str, str][source]

Get all specified IDs of the paper.

Returns

ids – Dictionary whose keys are ids type and value are ids values.

Return type

dict

property paragraphs: Generator[tuple[str, str], None, None]

Get all paragraphs and titles of sections they are part of.

Paragraphs can be parts of text body, or figure or table captions.

Yields
  • section (str) – The section title.

  • text (str) – The paragraph content.

parse_section(section: Element) Generator[tuple[str, str], None, None][source]

Parse section children depending on the tag.

Parameters

section – The input XML element.

Returns

  • str – The section title.

  • str – A parsed string representation of the input XML element.

property pmc_id: str | None

Get PMC ID.

Returns

PMC ID if specified, otherwise None.

Return type

str or None

property pubmed_id: str | None

Get Pubmed ID.

Returns

Pubmed ID if specified, otherwise None.

Return type

str or None

property title: str

Get the article title.

Returns

The article title.

Return type

str

class PubMedXMLParser(data: Element | Path | str)[source]

Bases: bluesearch.database.article.ArticleParser

Parser for PubMed abstract.

property abstract: Iterable[str]

Get a sequence of paragraphs in the article abstract.

Returns

The paragraphs of the article abstract.

Return type

iterable of str

property authors: Iterable[str]

Get all author names.

Returns

All authors.

Return type

iterable of str

property doi: str | None

Get DOI.

Returns

DOI if specified, otherwise None.

Return type

str or None

property paragraphs: Iterable[tuple[str, str]]

Get all paragraphs and titles of sections they are part of.

Returns

For each paragraph a tuple with two strings is returned. The first is the section title, the second the paragraph content.

Return type

iterable of (str, str)

property pmc_id: str | None

Get PMC ID.

Returns

PMC ID if specified, otherwise None.

Return type

str or None

property pubmed_id: str | None

Get Pubmed ID.

Returns

Pubmed ID if specified, otherwise None.

Return type

str or None

property title: str

Get the article title.

Returns

The article title.

Return type

str

class TEIXMLParser(path: str | Path, is_arxiv: bool | None = False)[source]

Bases: bluesearch.database.article.ArticleParser

Parser for TEI XML files.

Parameters
  • path – The path to a TEI XML file.

  • is_arxiv – Set to True if the TEI XML file was generated by parsing an arXiv PDF.

property abstract: Generator[str, None, None]

Get a sequence of paragraphs in the article abstract.

Yields

str – The paragraphs of the article abstract.

property arxiv_id: str | None

Get arXiv ID.

Returns

arXiv ID if specified, otherwise None.

Return type

str or None

property authors: Generator[str, None, None]

Get all author names.

Yields

str – Every author, in the format “Given_Name(s) Surname”.

property doi: str | None

Get DOI.

Returns

DOI if specified, otherwise None.

Return type

str or None

property paragraphs: Generator[tuple[str, str], None, None]

Get all paragraphs and titles of sections they are part of.

Paragraphs can be parts of text body, or figure or table captions.

Yields
  • section_title (str) – The section title.

  • text (str) – The paragraph content.

property tei_ids: dict

Extract all IDs of the TEI XML.

Returns

Dictionary containing all the IDs of the TEI XML content with the key being the ID type and the value being the ID value.

Return type

dict

property title: str

Get the article title.

Returns

The article title.

Return type

str

get_arxiv_id(path: str | Path, with_prefix: bool = True) str[source]

Compute arXiv ID, including version, from file path.

Parameters
  • path – The file path to an arXiv article.

  • with_prefix – If True, the returned arXiv ID will include the prefix “arxiv:”.

Returns

The computed arXiv ID.

Return type

str

Raises

ValueError – If no valid arXiv ID could be inferred from the file path.

References

https://arxiv.org/help/arxiv_identifier