bluesearch.database.topic module¶

Utils for journal/articles topics.

extract_article_topics_for_pubmed_article(xml_article: Element) → list[str] | None[source]¶

Extract article topics of a PubMed article.

Parameters: xml_article – XML parse of an article for which to extract journal and article topics.
Returns: article_topics – Article topics extracted for the given article.
Return type: list[str] | None

extract_article_topics_from_medrxiv_article(path: pathlib.Path | str) → tuple[str, str][source]¶

Extract topic of a medRxiv/bioRxiv article.

The .meca file should always have a fixed structure. Namely, there is a folder content and inside of it there should be a single .xml file containing the text and the metadata of the article.

Parameters

path – Path to a .meca file (which is nothing else than a zip archive) with a fixed structured.

Returns

topic (str) – The subject area of the article.
journal (str) – The journal the article was published in. Should be either “medRxiv” or “bioRxiv”.

Raises

ValueError – Appropriate XML not found or the journal or topic are missing.

extract_journal_topics_for_pubmed_article(xml_article: Element) → list[str] | None[source]¶

Extract journal topics of a PubMed article.

Parameters: xml_article – XML parse of an article for which to extract journal and article topics.
Returns: journal_topics – Journal topics extracted for the given article.
Return type: list[str] | None

extract_pubmed_id_from_pmc_file(path: str | pathlib.Path) → str | None[source]¶

Retrieve Pubmed ID from PMC XML file.

Parameters: path – Path to PMC XML.
Returns: pubmed_id – Pubmed ID of the given article
Return type: str

get_topics_for_arxiv_articles(arxiv_paths: Iterable[pathlib.Path | str], batch_size: int = 400) → dict[pathlib.Path, list[str]][source]¶

Extract journal topics of one or more arXiv article.

Parameters

arxiv_paths – Full paths to the arXiv articles to consider.
batch_size – Metadata are retrieved using the arXiv API [1] in batches of size batch_size. Large batches values may create long request URLs that cause the arXiv API to fail.

Returns

article_topics – Maps each of the paths to a list of corresponding arXiv article topics. See [2] for an explanation of arXiv topics taxonomy.

Return type

dict[pathlib.Path , list[str]]

Raises

ValueError – If the arXiv API does not return the correct number of metadata.

References

[1] https://arxiv.org/help/api/user-manual [2] https://arxiv.org/category_taxonomy

get_topics_for_pmc_article(pmc_path: pathlib.Path | str) → list[str] | None[source]¶

Extract journal topics of a PMC article.

Parameters: pmc_path – Path to the PMC article to consider
Returns: journal_topics – Journal topics for the given article.
Return type: list[str] | None

request_mesh_from_nlm_ta(nlm_ta: str) → list[dict] | None[source]¶

Retrieve Medical Subject Heading from Journal’s NLM Title Abbreviation.

Parameters: nlm_ta – NLM Title Abbreviation of Journal.
Returns: List containing all meshs of the Journal.
Return type: meshs

References

https://www.ncbi.nlm.nih.gov/books/NBK3799/#catalog.Title_Abbreviation_ta

request_mesh_from_pubmed_id(pubmed_ids: Iterable[str]) → dict[source]¶

Retrieve Medical Subject Headings from Pubmed ID.

Parameters: pubmed_ids – List of Pubmed IDs.
Returns: pubmed_to_meshs – Dictionary containing Pubmed IDs as keys with corresponding Medical Subject Headings list as values.
Return type: dict

References

https://dataguide.nlm.nih.gov/eutilities/utilities.html#efetch