This function loads a previously downloaded hathifile into memory (or
downloads the latest one if it can't find it). It also turns the column
us_gov_doc_flag
into a logical value (TRUE
or FALSE
) and eliminates
9999
values for rights_date_used
(sets them to NA
).
The name of the downloaded hathifile. If NULL
, it will
attempt to guess it from getOption("hathiTools.hathifile.dir")
and
getOption("hathiTools.hathifile")
; if it can't find it or the file doesn't
exist, it will attempt to download it to the directory in
getOption("hathiTools.hathifile.dir")
using download_hathifile.
The directory where the raw hathifile is to be found.
The date of the hathifile. (A new one is released every
month). This defaults to getOption("hathiTools.hathifile")
, which is just
the date of the last downloaded hathifile.
If wanted, a set of columns to load. Since the file is so large,
one can reduce memory use by selecting only certain columns. These can be
any of the following: htid (required), access, rights, ht_bib_key,
description, source, source_bib_num, oclc_num, isbn, issn, lccn, title,
imprint, rights_reason_code, rights_timestamp, us_gov_doc_flag,
rights_date_used, pub_place, lang, bib_fmt, collection_code,
content_provider_code, responsible_entity_code, digitization_agent_code,
access_profile_code, author
. If cols = "REDUCED", the function loads a
reduced set of columns: htid, ht_bib_key, description, source, title,
imprint, rights_date_used, us_gov_doc_flag
, lang
, bib_fmt
, and
author
Fixes 9999
values in rights_date_used
by changing them to
NA
. Default is TRUE
.
A very large tibble, with over 17 million records, loaded into memory. The tibble package does some lazy loading to minimize resource use, but fully loaded this data frame takes over 5GB in memory.