bluesearch.database.download module

Facilities to download articles from different sources.

download_articles(url_list: list[str], output_dir: Path) None[source]

Download articles.

Parameters
  • url_list – List of URLs to query.

  • output_dir – Output directory to save the download. We assume that it already exists.

download_gcs_blob(blob: google.cloud.storage.blob.Blob, out_dir: pathlib.Path, *, flatten: bool = False) None[source]

Download a Google Cloud Storage blob.

Parameters
  • blob – The blob to download.

  • out_dir – The output directory.

  • flatten – If false (default) then the directory structure encoded in the blob will be recreated, otherwise the downloaded file will be placed directly into the output directory. For example, if the blob name is “my_files/subdir/file.bin” and flatten is true then the file will be downloaded to “<output_dir>/file.bin”, otherwise it will be placed into “<output_dir>/my_files/subdir/file.bin”.

download_s3_articles(bucket: ServiceResource, url_dict: dict[str, list[str]], output_dir: Path) None[source]

Download articles from AWS S3.

Parameters
  • bucket – AWS bucket.

  • url_dict – Keys represent different months. Values represent lists of the actual .meca files.

  • output_dir – Output directory to save the download. It will be automatically created in case it does not exist.

generate_pmc_urls(component: str, start_date: datetime, end_date: datetime | None = None) list[str][source]

Generate the list of all PMC incremental files to download.

Parameters
  • component ({"author_manuscript", "oa_comm", "oa_noncomm"}) – Part of the PMC to download.

  • start_date – Starting date to download the incremental files.

  • end_date – Ending date. If None, today is considered as the ending date.

Returns

List of all the requests to make on PMC

Return type

list of str

Raises

ValueError – If the chosen component does not exist on PMC.

get_daterange_list(start_date: datetime, end_date: datetime | None = None, delta: str = 'day') list[datetime][source]

Retrieve list of datetimes between a start date and an end date (both inclusive).

If delta=day then we discard hours, minutes, seconds and milliseconds. If delta=month then we discard days, hours, minutes, seconds and milliseconds.

Parameters
  • start_date – Starting date (inclusive).

  • end_date – Ending date (inclusive). If None, today is considered as the ending date.

  • delta ({"day", "month"}) – Time difference between two consecutive dates.

Returns

List of all days between start date and end date included.

Return type

list of datetime

get_gcs_urls(bucket: Bucket, start_date: datetime, end_date: datetime | None = None) dict[str, list[Blob]][source]

Get Google Cloud Storage urls.

Parameters
  • bucket – GCS bucket.

  • start_date – Starting date to download the incremental files (inclusive).

  • end_date – Ending date. If None, today is considered as the ending date (inclusive).

Returns

Keys represent different months. Values are list of blobs corresponding to actual PDF files.

Return type

url_dict

get_pubmed_urls(start_date: datetime, end_date: datetime | None = None) list[str][source]

Get from the Internet the list of all PubMed incremental files to download.

Parameters
  • start_date – Starting date to download the incremental files (inclusive).

  • end_date – Ending date (inclusive). If None, today is considered as the ending date.

Returns

List of all the PubMed urls.

Return type

list of str

get_s3_urls(bucket: ServiceResource, start_date: datetime, end_date: datetime | None = None) dict[str, list[str]][source]

Get S3 urls.

We actually send a request to the AWS server and there is a charge.

Parameters
  • bucket – AWS bucket.

  • start_date – Starting date to download the incremental files (inclusive).

  • end_date – Ending date. If None, today is considered as the ending date (inclusive).

Returns

Keys represent different months. Values represent lists of the actual .meca files.

Return type

url_dict