bluesearch.database.article module¶
Abstraction of scientific article data and related tools.
- class Article(title: str, authors: Sequence[str], abstract: Sequence[str], section_paragraphs: Sequence[Tuple[str, str]], pubmed_id: Optional[str] = None, pmc_id: Optional[str] = None, arxiv_id: Optional[str] = None, doi: Optional[str] = None, uid: Optional[str] = None)[source]¶
Bases:
mashumaro.mixins.json.DataClassJSONMixinAbstraction of a scientific article and its contents.
- abstract: Sequence[str]¶
- arxiv_id: Optional[str] = None¶
- authors: Sequence[str]¶
- doi: Optional[str] = None¶
- classmethod from_dict(d)¶
- iter_paragraphs(with_abstract: bool = False) Generator[tuple[str, str], None, None][source]¶
Iterate over all paragraphs in the article.
- Parameters
with_abstract (bool) – If true the abstract paragraphs will be included at the beginning.
- Yields
str – Section title of the section the paragraph is in.
str – The paragraph text.
- classmethod parse(parser: bluesearch.database.article.ArticleParser) bluesearch.database.article.Article[source]¶
Parse an article through a parser.
- Parameters
parser – An article parser instance.
- pmc_id: Optional[str] = None¶
- pubmed_id: Optional[str] = None¶
- section_paragraphs: Sequence[Tuple[str, str]]¶
- title: str¶
- to_dict()¶
- uid: Optional[str] = None¶
- class ArticleParser[source]¶
Bases:
abc.ABCAn abstract base class for article parsers.
- abstract property abstract: Iterable[str]¶
Get a sequence of paragraphs in the article abstract.
- Returns
The paragraphs of the article abstract.
- Return type
iterable of str
- property arxiv_id: str | None¶
Get arXiv ID.
- Returns
arXiv ID if specified, otherwise None.
- Return type
str or None
- abstract property authors: Iterable[str]¶
Get all author names.
- Returns
All authors.
- Return type
iterable of str
- property doi: str | None¶
Get DOI.
- Returns
DOI if specified, otherwise None.
- Return type
str or None
- static get_uid_from_identifiers(identifiers: tuple[str | None, ...]) str[source]¶
Generate a deterministic UID for a list of given paper identifiers.
Papers with the same values for the given identifiers get the same UID.
Missing values should have the value None, which is considered a value by itself. Then, identifiers (a, None) and identifiers (a, b) have two different UIDs.
- Parameters
identifiers – Values of the identifiers.
- Returns
A deterministic UID computed from the identifiers.
- Return type
str
- Raises
ValueError – If all identifiers are None.
- abstract property paragraphs: Iterable[tuple[str, str]]¶
Get all paragraphs and titles of sections they are part of.
- Returns
For each paragraph a tuple with two strings is returned. The first is the section title, the second the paragraph content.
- Return type
iterable of (str, str)
- property pmc_id: str | None¶
Get PMC ID.
- Returns
PMC ID if specified, otherwise None.
- Return type
str or None
- property pubmed_id: str | None¶
Get Pubmed ID.
- Returns
Pubmed ID if specified, otherwise None.
- Return type
str or None
- abstract property title: str¶
Get the article title.
- Returns
The article title.
- Return type
str
- property uid: str¶
Generate deterministic UID for an article.
The UID is usually created by hashing the identifiers of the article. If no identifier is available, then the unique ID is computed by hashing the whole content of the article.
- Returns
A deterministic UID.
- Return type
str
- class ArticleSource(value)[source]¶
Bases:
enum.EnumThe source of an article.
- ARXIV = 'arxiv'¶
- BIORXIV = 'biorxiv'¶
- MEDRXIV = 'medrxiv'¶
- PMC = 'pmc'¶
- PUBMED = 'pubmed'¶
- UNKNOWN = 'unknown'¶
- class CORD19ArticleParser(json_file: dict)[source]¶
Bases:
bluesearch.database.article.ArticleParserParser for CORD-19 JSON files.
- Parameters
json_file – The contents of a JSON-file from the CORD-19 database.
- property abstract: list[str]¶
Get a sequence of paragraphs in the article abstract.
- Returns
The paragraphs of the article abstract.
- Return type
list of str
- property authors: Generator[str, None, None]¶
Get all author names.
- Yields
str – Every author.
- property paragraphs: Generator[tuple[str, str], None, None]¶
Get all paragraphs and titles of sections they are part of.
- Yields
str – The section title.
str – The paragraph content.
- property pmc_id: str | None¶
Get PMC ID.
- Returns
PMC ID if specified, otherwise None.
- Return type
str or None
- property title: str¶
Get the article title.
- Returns
The article title.
- Return type
str
- class JATSXMLParser(xml_stream: IO)[source]¶
Bases:
bluesearch.database.article.ArticleParserParser for JATS XML files.
This could be used for articles from PubMed Central, bioRxiv, and medRxiv.
- Parameters
xml_stream – The xml stream of the article.
- property abstract: Generator[str, None, None]¶
Get a sequence of paragraphs in the article abstract.
- Yields
str – The paragraphs of the article abstract.
- property authors: Generator[str, None, None]¶
Get all author names.
- Yields
str – Every author, in the format “Given_Name(s) Surname”.
- property doi: str | None¶
Get DOI.
- Returns
DOI if specified, otherwise None.
- Return type
str or None
- classmethod from_string(xml_string: str) bluesearch.database.article.JATSXMLParser[source]¶
Read xml string and instantiate JATSXML Parser.
- Parameters
xml_string – Raw content of the article
- Returns
Parser containing the article content.
- Return type
- classmethod from_xml(path: str | Path) JATSXMLParser[source]¶
Read xml file and instantiate JATSXML Parser.
- Parameters
path – Path to the article (with .xml extension)
- Returns
Parser containing the article content.
- Return type
- classmethod from_zip(path: str | Path) JATSXMLParser[source]¶
Read xml file from a zipped .meca folder and instantiate JATSXML Parser.
- Parameters
path – Path to the article (with .meca extension)
- Returns
Parser containing the article content.
- Return type
- get_ids() dict[str, str][source]¶
Get all specified IDs of the paper.
- Returns
ids – Dictionary whose keys are ids type and value are ids values.
- Return type
dict
- property paragraphs: Generator[tuple[str, str], None, None]¶
Get all paragraphs and titles of sections they are part of.
Paragraphs can be parts of text body, or figure or table captions.
- Yields
section (str) – The section title.
text (str) – The paragraph content.
- parse_section(section: Element) Generator[tuple[str, str], None, None][source]¶
Parse section children depending on the tag.
- Parameters
section – The input XML element.
- Returns
str – The section title.
str – A parsed string representation of the input XML element.
- property pmc_id: str | None¶
Get PMC ID.
- Returns
PMC ID if specified, otherwise None.
- Return type
str or None
- property pubmed_id: str | None¶
Get Pubmed ID.
- Returns
Pubmed ID if specified, otherwise None.
- Return type
str or None
- property title: str¶
Get the article title.
- Returns
The article title.
- Return type
str
- class PubMedXMLParser(data: Element | Path | str)[source]¶
Bases:
bluesearch.database.article.ArticleParserParser for PubMed abstract.
- property abstract: Iterable[str]¶
Get a sequence of paragraphs in the article abstract.
- Returns
The paragraphs of the article abstract.
- Return type
iterable of str
- property authors: Iterable[str]¶
Get all author names.
- Returns
All authors.
- Return type
iterable of str
- property doi: str | None¶
Get DOI.
- Returns
DOI if specified, otherwise None.
- Return type
str or None
- property paragraphs: Iterable[tuple[str, str]]¶
Get all paragraphs and titles of sections they are part of.
- Returns
For each paragraph a tuple with two strings is returned. The first is the section title, the second the paragraph content.
- Return type
iterable of (str, str)
- property pmc_id: str | None¶
Get PMC ID.
- Returns
PMC ID if specified, otherwise None.
- Return type
str or None
- property pubmed_id: str | None¶
Get Pubmed ID.
- Returns
Pubmed ID if specified, otherwise None.
- Return type
str or None
- property title: str¶
Get the article title.
- Returns
The article title.
- Return type
str
- class TEIXMLParser(path: str | Path, is_arxiv: bool | None = False)[source]¶
Bases:
bluesearch.database.article.ArticleParserParser for TEI XML files.
- Parameters
path – The path to a TEI XML file.
is_arxiv – Set to True if the TEI XML file was generated by parsing an arXiv PDF.
- property abstract: Generator[str, None, None]¶
Get a sequence of paragraphs in the article abstract.
- Yields
str – The paragraphs of the article abstract.
- property arxiv_id: str | None¶
Get arXiv ID.
- Returns
arXiv ID if specified, otherwise None.
- Return type
str or None
- property authors: Generator[str, None, None]¶
Get all author names.
- Yields
str – Every author, in the format “Given_Name(s) Surname”.
- property doi: str | None¶
Get DOI.
- Returns
DOI if specified, otherwise None.
- Return type
str or None
- property paragraphs: Generator[tuple[str, str], None, None]¶
Get all paragraphs and titles of sections they are part of.
Paragraphs can be parts of text body, or figure or table captions.
- Yields
section_title (str) – The section title.
text (str) – The paragraph content.
- property tei_ids: dict¶
Extract all IDs of the TEI XML.
- Returns
Dictionary containing all the IDs of the TEI XML content with the key being the ID type and the value being the ID value.
- Return type
dict
- property title: str¶
Get the article title.
- Returns
The article title.
- Return type
str
- get_arxiv_id(path: str | Path, with_prefix: bool = True) str[source]¶
Compute arXiv ID, including version, from file path.
- Parameters
path – The file path to an arXiv article.
with_prefix – If True, the returned arXiv ID will include the prefix “arxiv:”.
- Returns
The computed arXiv ID.
- Return type
str
- Raises
ValueError – If no valid arXiv ID could be inferred from the file path.
References