bluesearch.database.download module¶
Facilities to download articles from different sources.
- download_articles(url_list: list[str], output_dir: Path) None[source]¶
Download articles.
- Parameters
url_list – List of URLs to query.
output_dir – Output directory to save the download. We assume that it already exists.
- download_gcs_blob(blob: google.cloud.storage.blob.Blob, out_dir: pathlib.Path, *, flatten: bool = False) None[source]¶
Download a Google Cloud Storage blob.
- Parameters
blob – The blob to download.
out_dir – The output directory.
flatten – If false (default) then the directory structure encoded in the blob will be recreated, otherwise the downloaded file will be placed directly into the output directory. For example, if the blob name is “my_files/subdir/file.bin” and flatten is true then the file will be downloaded to “<output_dir>/file.bin”, otherwise it will be placed into “<output_dir>/my_files/subdir/file.bin”.
- download_s3_articles(bucket: ServiceResource, url_dict: dict[str, list[str]], output_dir: Path) None[source]¶
Download articles from AWS S3.
- Parameters
bucket – AWS bucket.
url_dict – Keys represent different months. Values represent lists of the actual .meca files.
output_dir – Output directory to save the download. It will be automatically created in case it does not exist.
- generate_pmc_urls(component: str, start_date: datetime, end_date: datetime | None = None) list[str][source]¶
Generate the list of all PMC incremental files to download.
- Parameters
component ({"author_manuscript", "oa_comm", "oa_noncomm"}) – Part of the PMC to download.
start_date – Starting date to download the incremental files.
end_date – Ending date. If None, today is considered as the ending date.
- Returns
List of all the requests to make on PMC
- Return type
list of str
- Raises
ValueError – If the chosen component does not exist on PMC.
- get_daterange_list(start_date: datetime, end_date: datetime | None = None, delta: str = 'day') list[datetime][source]¶
Retrieve list of datetimes between a start date and an end date (both inclusive).
If delta=day then we discard hours, minutes, seconds and milliseconds. If delta=month then we discard days, hours, minutes, seconds and milliseconds.
- Parameters
start_date – Starting date (inclusive).
end_date – Ending date (inclusive). If None, today is considered as the ending date.
delta ({"day", "month"}) – Time difference between two consecutive dates.
- Returns
List of all days between start date and end date included.
- Return type
list of datetime
- get_gcs_urls(bucket: Bucket, start_date: datetime, end_date: datetime | None = None) dict[str, list[Blob]][source]¶
Get Google Cloud Storage urls.
- Parameters
bucket – GCS bucket.
start_date – Starting date to download the incremental files (inclusive).
end_date – Ending date. If None, today is considered as the ending date (inclusive).
- Returns
Keys represent different months. Values are list of blobs corresponding to actual PDF files.
- Return type
url_dict
- get_pubmed_urls(start_date: datetime, end_date: datetime | None = None) list[str][source]¶
Get from the Internet the list of all PubMed incremental files to download.
- Parameters
start_date – Starting date to download the incremental files (inclusive).
end_date – Ending date (inclusive). If None, today is considered as the ending date.
- Returns
List of all the PubMed urls.
- Return type
list of str
- get_s3_urls(bucket: ServiceResource, start_date: datetime, end_date: datetime | None = None) dict[str, list[str]][source]¶
Get S3 urls.
We actually send a request to the AWS server and there is a charge.
- Parameters
bucket – AWS bucket.
start_date – Starting date to download the incremental files (inclusive).
end_date – Ending date. If None, today is considered as the ending date (inclusive).
- Returns
Keys represent different months. Values represent lists of the actual .meca files.
- Return type
url_dict