bluesearch.database.download module¶

Facilities to download articles from different sources.

download_articles(url_list: list[str], output_dir: Path) → None[source]¶

Download articles.

Parameters

url_list – List of URLs to query.
output_dir – Output directory to save the download. We assume that it already exists.

download_gcs_blob(blob: google.cloud.storage.blob.Blob, out_dir: pathlib.Path, *, flatten: bool = False) → None[source]¶

Download a Google Cloud Storage blob.

Parameters

blob – The blob to download.
out_dir – The output directory.
flatten – If false (default) then the directory structure encoded in the blob will be recreated, otherwise the downloaded file will be placed directly into the output directory. For example, if the blob name is “my_files/subdir/file.bin” and flatten is true then the file will be downloaded to “<output_dir>/file.bin”, otherwise it will be placed into “<output_dir>/my_files/subdir/file.bin”.

download_s3_articles(bucket: ServiceResource, url_dict: dict[str, list[str]], output_dir: Path) → None[source]¶

Download articles from AWS S3.

Parameters

bucket – AWS bucket.
url_dict – Keys represent different months. Values represent lists of the actual .meca files.
output_dir – Output directory to save the download. It will be automatically created in case it does not exist.

generate_pmc_urls(component: str, start_date: datetime, end_date: datetime | None = None) → list[str][source]¶

Generate the list of all PMC incremental files to download.

Parameters

component ({"author_manuscript", "oa_comm", "oa_noncomm"}) – Part of the PMC to download.
start_date – Starting date to download the incremental files.
end_date – Ending date. If None, today is considered as the ending date.

Returns

List of all the requests to make on PMC

Return type

list of str

Raises

ValueError – If the chosen component does not exist on PMC.

get_daterange_list(start_date: datetime, end_date: datetime | None = None, delta: str = 'day') → list[datetime][source]¶

Retrieve list of datetimes between a start date and an end date (both inclusive).

If delta=day then we discard hours, minutes, seconds and milliseconds. If delta=month then we discard days, hours, minutes, seconds and milliseconds.

Parameters

start_date – Starting date (inclusive).
end_date – Ending date (inclusive). If None, today is considered as the ending date.
delta ({"day", "month"}) – Time difference between two consecutive dates.

Returns

List of all days between start date and end date included.

Return type

list of datetime

get_gcs_urls(bucket: Bucket, start_date: datetime, end_date: datetime | None = None) → dict[str, list[Blob]][source]¶

Get Google Cloud Storage urls.

Parameters

bucket – GCS bucket.
start_date – Starting date to download the incremental files (inclusive).
end_date – Ending date. If None, today is considered as the ending date (inclusive).

Returns

Keys represent different months. Values are list of blobs corresponding to actual PDF files.

Return type

url_dict

get_pubmed_urls(start_date: datetime, end_date: datetime | None = None) → list[str][source]¶

Get from the Internet the list of all PubMed incremental files to download.

Parameters

start_date – Starting date to download the incremental files (inclusive).
end_date – Ending date (inclusive). If None, today is considered as the ending date.

Returns

List of all the PubMed urls.

Return type

list of str

get_s3_urls(bucket: ServiceResource, start_date: datetime, end_date: datetime | None = None) → dict[str, list[str]][source]¶

Get S3 urls.

We actually send a request to the AWS server and there is a charge.

Parameters

bucket – AWS bucket.
start_date – Starting date to download the incremental files (inclusive).
end_date – Ending date. If None, today is considered as the ending date (inclusive).

Returns

Keys represent different months. Values represent lists of the actual .meca files.

Return type

url_dict