This function loads a previously downloaded hathifile into memory (or downloads the latest one if it can't find it). It also turns the column us_gov_doc_flag into a logical value (TRUE or FALSE) and eliminates 9999 values for rights_date_used (sets them to NA).

load_raw_hathifile(
  filename = NULL,
  dir = getOption("hathiTools.hathifile.dir"),
  hathifile_date = getOption("hathiTools.hathifile"),
  cols,
  fix_date = TRUE
)

Arguments

filename

The name of the downloaded hathifile. If NULL, it will attempt to guess it from getOption("hathiTools.hathifile.dir") and getOption("hathiTools.hathifile"); if it can't find it or the file doesn't exist, it will attempt to download it to the directory in getOption("hathiTools.hathifile.dir") using download_hathifile.

dir

The directory where the raw hathifile is to be found.

hathifile_date

The date of the hathifile. (A new one is released every month). This defaults to getOption("hathiTools.hathifile"), which is just the date of the last downloaded hathifile.

cols

If wanted, a set of columns to load. Since the file is so large, one can reduce memory use by selecting only certain columns. These can be any of the following: htid (required), access, rights, ht_bib_key, description, source, source_bib_num, oclc_num, isbn, issn, lccn, title, imprint, rights_reason_code, rights_timestamp, us_gov_doc_flag, rights_date_used, pub_place, lang, bib_fmt, collection_code, content_provider_code, responsible_entity_code, digitization_agent_code, access_profile_code, author. If cols = "REDUCED", the function loads a reduced set of columns: htid, ht_bib_key, description, source, title, imprint, rights_date_used, us_gov_doc_flag, lang, bib_fmt, and author

fix_date

Fixes 9999 values in rights_date_used by changing them to NA. Default is TRUE.

Value

A very large tibble, with over 17 million records, loaded into memory. The tibble package does some lazy loading to minimize resource use, but fully loaded this data frame takes over 5GB in memory.