This function downloads the big hathifile catalog with simple metadata for the over 17 million digitized volumes in the Hathi Trust digital library collection. It can be used in conjunction with workset_builder and rsync to select an appropriate sample of Hathi Trust Extracted Features files and metadata for further analysis. Warning - it's a 1GB file; if the latest version of the file (there's a new one every month) has been downloaded already, the function will just return the file name and won't attempt to download it again.

download_hathifile(
  url = "https://www.hathitrust.org/hathifiles",
  dir = getOption("hathiTools.hathifile.dir"),
  full_catalog = TRUE
)

Arguments

url

The URL for the Hathi Trust hathifiles https://www.hathitrust.org/hathifiles

dir

The directory to use to save the downloaded hathifile. Defaults to getOption("hathiTools.hathifile.dir"), which on loading the package is just ./raw-hathifiles (a directory which will be created if it doesn't exist already when you call the function).

full_catalog

Whether to download the full catalog (>17 million records), or just the latest update (there's a new "update file" every day, and a new version of the full catalog every month). Default is TRUE - download the full catalog.

Value

The downloaded filename.