bluesearch.database.article module¶

Abstraction of scientific article data and related tools.

class Article(title: str, authors: Sequence[str], abstract: Sequence[str], section_paragraphs: Sequence[Tuple[str, str]], pubmed_id: Optional[str] = None, pmc_id: Optional[str] = None, arxiv_id: Optional[str] = None, doi: Optional[str] = None, uid: Optional[str] = None)[source]¶

Bases: mashumaro.mixins.json.DataClassJSONMixin

Abstraction of a scientific article and its contents.

abstract: Sequence[str]¶

arxiv_id: Optional[str] = None¶

authors: Sequence[str]¶

doi: Optional[str] = None¶

classmethod from_dict(d)¶

iter_paragraphs(with_abstract: bool = False) → Generator[tuple[str, str], None, None][source]¶

Iterate over all paragraphs in the article.

Parameters

with_abstract (bool) – If true the abstract paragraphs will be included at the beginning.

Yields

str – Section title of the section the paragraph is in.
str – The paragraph text.

classmethod parse(parser: bluesearch.database.article.ArticleParser) → bluesearch.database.article.Article[source]¶

Parse an article through a parser.

Parameters: parser – An article parser instance.

pmc_id: Optional[str] = None¶

pubmed_id: Optional[str] = None¶

section_paragraphs: Sequence[Tuple[str, str]]¶

title: str¶

to_dict()¶

uid: Optional[str] = None¶

class ArticleParser[source]¶

Bases: abc.ABC

An abstract base class for article parsers.

abstract property abstract: Iterable[str]¶

Get a sequence of paragraphs in the article abstract.

Returns: The paragraphs of the article abstract.
Return type: iterable of str

property arxiv_id: str | None¶

Get arXiv ID.

Returns: arXiv ID if specified, otherwise None.
Return type: str or None

abstract property authors: Iterable[str]¶

Get all author names.

Returns: All authors.
Return type: iterable of str

property doi: str | None¶

Get DOI.

Returns: DOI if specified, otherwise None.
Return type: str or None

static get_uid_from_identifiers(identifiers: tuple[str | None, ...]) → str[source]¶

Generate a deterministic UID for a list of given paper identifiers.

Papers with the same values for the given identifiers get the same UID.

Missing values should have the value None, which is considered a value by itself. Then, identifiers (a, None) and identifiers (a, b) have two different UIDs.

Parameters: identifiers – Values of the identifiers.
Returns: A deterministic UID computed from the identifiers.
Return type: str
Raises: ValueError – If all identifiers are None.

abstract property paragraphs: Iterable[tuple[str, str]]¶

Get all paragraphs and titles of sections they are part of.

Returns: For each paragraph a tuple with two strings is returned. The first is the section title, the second the paragraph content.
Return type: iterable of (str, str)

property pmc_id: str | None¶

Get PMC ID.

Returns: PMC ID if specified, otherwise None.
Return type: str or None

property pubmed_id: str | None¶

Get Pubmed ID.

Returns: Pubmed ID if specified, otherwise None.
Return type: str or None

abstract property title: str¶

Get the article title.

Returns: The article title.
Return type: str

property uid: str¶

Generate deterministic UID for an article.

The UID is usually created by hashing the identifiers of the article. If no identifier is available, then the unique ID is computed by hashing the whole content of the article.

Returns: A deterministic UID.
Return type: str

class ArticleSource(value)[source]¶

Bases: enum.Enum

The source of an article.

ARXIV = 'arxiv'¶

BIORXIV = 'biorxiv'¶

MEDRXIV = 'medrxiv'¶

PMC = 'pmc'¶

PUBMED = 'pubmed'¶

UNKNOWN = 'unknown'¶

class CORD19ArticleParser(json_file: dict)[source]¶

Bases: bluesearch.database.article.ArticleParser

Parser for CORD-19 JSON files.

Parameters: json_file – The contents of a JSON-file from the CORD-19 database.

property abstract: list[str]¶

Get a sequence of paragraphs in the article abstract.

Returns: The paragraphs of the article abstract.
Return type: list of str

property authors: Generator[str, None, None]¶

Get all author names.

Yields: str – Every author.

property paragraphs: Generator[tuple[str, str], None, None]¶

Get all paragraphs and titles of sections they are part of.

Yields

str – The section title.
str – The paragraph content.

property pmc_id: str | None¶

Get PMC ID.

Returns: PMC ID if specified, otherwise None.
Return type: str or None

property title: str¶

Get the article title.

Returns: The article title.
Return type: str

class JATSXMLParser(xml_stream: IO)[source]¶

Bases: bluesearch.database.article.ArticleParser

Parser for JATS XML files.

This could be used for articles from PubMed Central, bioRxiv, and medRxiv.

Parameters: xml_stream – The xml stream of the article.

property abstract: Generator[str, None, None]¶

Get a sequence of paragraphs in the article abstract.

Yields: str – The paragraphs of the article abstract.

property authors: Generator[str, None, None]¶

Get all author names.

Yields: str – Every author, in the format “Given_Name(s) Surname”.

property doi: str | None¶

Get DOI.

Returns: DOI if specified, otherwise None.
Return type: str or None

classmethod from_string(xml_string: str) → bluesearch.database.article.JATSXMLParser[source]¶

Read xml string and instantiate JATSXML Parser.

Parameters: xml_string – Raw content of the article
Returns: Parser containing the article content.
Return type: JATSXMLParser

classmethod from_xml(path: str | Path) → JATSXMLParser[source]¶

Read xml file and instantiate JATSXML Parser.

Parameters: path – Path to the article (with .xml extension)
Returns: Parser containing the article content.
Return type: JATSXMLParser

classmethod from_zip(path: str | Path) → JATSXMLParser[source]¶

Read xml file from a zipped .meca folder and instantiate JATSXML Parser.

Parameters: path – Path to the article (with .meca extension)
Returns: Parser containing the article content.
Return type: JATSXMLParser

get_ids() → dict[str, str][source]¶

Get all specified IDs of the paper.

Returns: ids – Dictionary whose keys are ids type and value are ids values.
Return type: dict

property paragraphs: Generator[tuple[str, str], None, None]¶

Get all paragraphs and titles of sections they are part of.

Paragraphs can be parts of text body, or figure or table captions.

Yields

section (str) – The section title.
text (str) – The paragraph content.

parse_section(section: Element) → Generator[tuple[str, str], None, None][source]¶

Parse section children depending on the tag.

Parameters

section – The input XML element.

Returns

str – The section title.
str – A parsed string representation of the input XML element.

property pmc_id: str | None¶

Get PMC ID.

Returns: PMC ID if specified, otherwise None.
Return type: str or None

property pubmed_id: str | None¶

Get Pubmed ID.

Returns: Pubmed ID if specified, otherwise None.
Return type: str or None

property title: str¶

Get the article title.

Returns: The article title.
Return type: str

class PubMedXMLParser(data: Element | Path | str)[source]¶

Bases: bluesearch.database.article.ArticleParser

Parser for PubMed abstract.

property abstract: Iterable[str]¶

Get a sequence of paragraphs in the article abstract.

Returns: The paragraphs of the article abstract.
Return type: iterable of str

property authors: Iterable[str]¶

Get all author names.

Returns: All authors.
Return type: iterable of str

property doi: str | None¶

Get DOI.

Returns: DOI if specified, otherwise None.
Return type: str or None

property paragraphs: Iterable[tuple[str, str]]¶

Get all paragraphs and titles of sections they are part of.

Returns: For each paragraph a tuple with two strings is returned. The first is the section title, the second the paragraph content.
Return type: iterable of (str, str)

property pmc_id: str | None¶

Get PMC ID.

Returns: PMC ID if specified, otherwise None.
Return type: str or None

property pubmed_id: str | None¶

Get Pubmed ID.

Returns: Pubmed ID if specified, otherwise None.
Return type: str or None

property title: str¶

Get the article title.

Returns: The article title.
Return type: str

class TEIXMLParser(path: str | Path, is_arxiv: bool | None = False)[source]¶

Bases: bluesearch.database.article.ArticleParser

Parser for TEI XML files.

Parameters

path – The path to a TEI XML file.
is_arxiv – Set to True if the TEI XML file was generated by parsing an arXiv PDF.

property abstract: Generator[str, None, None]¶

Get a sequence of paragraphs in the article abstract.

Yields: str – The paragraphs of the article abstract.

property arxiv_id: str | None¶

Get arXiv ID.

Returns: arXiv ID if specified, otherwise None.
Return type: str or None

property authors: Generator[str, None, None]¶

Get all author names.

Yields: str – Every author, in the format “Given_Name(s) Surname”.

property doi: str | None¶

Get DOI.

Returns: DOI if specified, otherwise None.
Return type: str or None

property paragraphs: Generator[tuple[str, str], None, None]¶

Get all paragraphs and titles of sections they are part of.

Paragraphs can be parts of text body, or figure or table captions.

Yields

section_title (str) – The section title.
text (str) – The paragraph content.

property tei_ids: dict¶

Extract all IDs of the TEI XML.

Returns: Dictionary containing all the IDs of the TEI XML content with the key being the ID type and the value being the ID value.
Return type: dict

property title: str¶

Get the article title.

Returns: The article title.
Return type: str

get_arxiv_id(path: str | Path, with_prefix: bool = True) → str[source]¶

Compute arXiv ID, including version, from file path.

Parameters

path – The file path to an arXiv article.
with_prefix – If True, the returned arXiv ID will include the prefix “arxiv:”.

Returns

The computed arXiv ID.

Return type

str

Raises

ValueError – If no valid arXiv ID could be inferred from the file path.

References

https://arxiv.org/help/arxiv_identifier