bluesearch.database.topic module

Utils for journal/articles topics.

extract_article_topics_for_pubmed_article(xml_article: Element) list[str] | None[source]

Extract article topics of a PubMed article.

Parameters

xml_article – XML parse of an article for which to extract journal and article topics.

Returns

article_topics – Article topics extracted for the given article.

Return type

list[str] | None

extract_article_topics_from_medrxiv_article(path: pathlib.Path | str) tuple[str, str][source]

Extract topic of a medRxiv/bioRxiv article.

The .meca file should always have a fixed structure. Namely, there is a folder content and inside of it there should be a single .xml file containing the text and the metadata of the article.

Parameters

path – Path to a .meca file (which is nothing else than a zip archive) with a fixed structured.

Returns

  • topic (str) – The subject area of the article.

  • journal (str) – The journal the article was published in. Should be either “medRxiv” or “bioRxiv”.

Raises

ValueError – Appropriate XML not found or the journal or topic are missing.

extract_journal_topics_for_pubmed_article(xml_article: Element) list[str] | None[source]

Extract journal topics of a PubMed article.

Parameters

xml_article – XML parse of an article for which to extract journal and article topics.

Returns

journal_topics – Journal topics extracted for the given article.

Return type

list[str] | None

extract_pubmed_id_from_pmc_file(path: str | pathlib.Path) str | None[source]

Retrieve Pubmed ID from PMC XML file.

Parameters

path – Path to PMC XML.

Returns

pubmed_id – Pubmed ID of the given article

Return type

str

get_topics_for_arxiv_articles(arxiv_paths: Iterable[pathlib.Path | str], batch_size: int = 400) dict[pathlib.Path, list[str]][source]

Extract journal topics of one or more arXiv article.

Parameters
  • arxiv_paths – Full paths to the arXiv articles to consider.

  • batch_size – Metadata are retrieved using the arXiv API [1] in batches of size batch_size. Large batches values may create long request URLs that cause the arXiv API to fail.

Returns

article_topics – Maps each of the paths to a list of corresponding arXiv article topics. See [2] for an explanation of arXiv topics taxonomy.

Return type

dict[pathlib.Path , list[str]]

Raises

ValueError – If the arXiv API does not return the correct number of metadata.

References

[1] https://arxiv.org/help/api/user-manual [2] https://arxiv.org/category_taxonomy

get_topics_for_pmc_article(pmc_path: pathlib.Path | str) list[str] | None[source]

Extract journal topics of a PMC article.

Parameters

pmc_path – Path to the PMC article to consider

Returns

journal_topics – Journal topics for the given article.

Return type

list[str] | None

request_mesh_from_nlm_ta(nlm_ta: str) list[dict] | None[source]

Retrieve Medical Subject Heading from Journal’s NLM Title Abbreviation.

Parameters

nlm_ta – NLM Title Abbreviation of Journal.

Returns

List containing all meshs of the Journal.

Return type

meshs

References

https://www.ncbi.nlm.nih.gov/books/NBK3799/#catalog.Title_Abbreviation_ta

request_mesh_from_pubmed_id(pubmed_ids: Iterable[str]) dict[source]

Retrieve Medical Subject Headings from Pubmed ID.

Parameters

pubmed_ids – List of Pubmed IDs.

Returns

pubmed_to_meshs – Dictionary containing Pubmed IDs as keys with corresponding Medical Subject Headings list as values.

Return type

dict

References

https://dataguide.nlm.nih.gov/eutilities/utilities.html#efetch