Caches downloaded JSON Extracted Features files to another format

This function takes a set of Hathi Trust IDs (usually already downloaded via rsync_from_hathi) and caches the JSON files to another format (e.g., csv or rds or parquet) along them. A typical workflow with this package normally involves selecting an appropriate set of Hathi Trust IDs (via workset_builder), downloading their Extracted Features files to your local machine (via rsync_from_hathi), caching these slow-to-load JSON Extracted Features files to a faster-loading format using cache_htids, and then using read_cached_htids to read them into a single data frame or arrow Dataset for further work.

cache_htids(
  htids,
  dir = getOption("hathiTools.ef.dir"),
  cache_type = c("ef", "meta", "pagemeta"),
  cache_format = getOption("hathiTools.cacheformat"),
  keep_json = TRUE,
  attempt_rsync = FALSE,
  attempt_parallel = FALSE
)

Arguments

htids: A character vector of Hathi Trust ids, a workset created with workset_builder, or a data frame with a column named "htid" containing the Hathi Trust ids that require caching. If the JSON Extracted Features files for these htids have not been downloaded via rsync_from_hathi or get_hathi_counts to dir, nothing will be cached (unless attempt_rsync is TRUE).
dir: The directory where the download extracted features files are to be found. Defaults to getOption("hathiTools.ef.dir"), which is just "hathi-ef" on load.
cache_type: Type of information cached. The default is c("ef", "meta", "pagemeta"), which refers to the extracted features, the volume metadata, and the page metadata. Omitting one of these caches or finds only the rest (e.g., cache_type = "ef" caches only the EF files, not their associated metadata or page metadata).
cache_format: File format of cache for Extracted Features files. Defaults to getOption("hathiTools.cacheformat"), which is "csv.gz" on load. Allowed cache types are: compressed csv (the default), "none" (no local caching of JSON download; only JSON file kept), "rds", "feather" and "parquet" (suitable for use with arrow; needs the arrow package installed), or "text2vec.csv" (a csv suitable for use with the package text2vec).
keep_json: Whether to keep the downloaded json files. Default is TRUE; if FALSE, it only keeps the local cached files (e.g., the csv files) and deletes the associated JSON files.
attempt_rsync: If TRUE, and some JSON EF files are not found in dir, the function will call rsync_from_hathi to attempt to download these first.
attempt_parallel: Default is FALSE. If TRUE, will attempt to use the furrr package to cache files in parallel. You will need to call future::plan() beforehand to determine the specific parallel strategy to be used; plan(multisession) usually works fine.

Value

A tibble with the paths of the cached files and an indicator of whether each htid was successfully cached.

Examples

# \donttest{
htids <- c("mdp.39015008706338", "mdp.39015058109706")
dir <- tempdir()

# Caches nothing (nothing has been downloaded to `dir`):

cache_htids(htids, dir = dir, cache_type = "ef")
#> 2 HTIDs cannot be cached, since their JSON EF files have not been downloaded or do not exist in the Hathi Trust rsync server.
#> Try using rsync_from_hathi(htids) to download them.
#> All existing JSON files already cached to required formats.
#> # A tibble: 0 × 5
#> # … with 5 variables: htid <chr>, local_loc <glue>, cache_format <chr>,
#> #   cache_type <chr>, exists <lgl>

# Tries to rsync first, then caches

cache_htids(htids, dir = dir, cache_type = "ef", attempt_rsync = TRUE)
#> Attempting to rsync 2 Hathi Trust IDs before caching
#> Preparing to cache 2 EF files to /tmp/RtmpdUz0R0 (../..) 
#> Now caching EF file for mdp.39015008706338
#> Now caching EF file for mdp.39015058109706
#> # A tibble: 2 × 5
#>   htid               local_loc                            cache…¹ cache…² exists
#>   <chr>              <glue>                               <chr>   <chr>   <lgl> 
#> 1 mdp.39015008706338 /tmp/RtmpdUz0R0/mdp/31003/mdp.39015… csv.gz  ef      TRUE  
#> 2 mdp.39015058109706 /tmp/RtmpdUz0R0/mdp/31500/mdp.39015… csv.gz  ef      TRUE  
#> # … with abbreviated variable names ¹cache_format, ²cache_type

# }