R/cache_tools.R
cache_htids.Rd
This function takes a set of Hathi Trust IDs (usually already downloaded via rsync_from_hathi) and caches the JSON files to another format (e.g., csv or rds or parquet) along them. A typical workflow with this package normally involves selecting an appropriate set of Hathi Trust IDs (via workset_builder), downloading their Extracted Features files to your local machine (via rsync_from_hathi), caching these slow-to-load JSON Extracted Features files to a faster-loading format using cache_htids, and then using read_cached_htids to read them into a single data frame or arrow Dataset for further work.
A character vector of Hathi Trust ids, a workset created with
workset_builder, or a data frame with a column named "htid" containing
the Hathi Trust ids that require caching. If the JSON Extracted Features
files for these htids have not been downloaded via rsync_from_hathi or
get_hathi_counts to dir
, nothing will be cached (unless attempt_rsync
is TRUE
).
The directory where the download extracted features files are to
be found. Defaults to getOption("hathiTools.ef.dir")
, which is just
"hathi-ef" on load.
Type of information cached. The default is c("ef", "meta",
"pagemeta"), which refers to the extracted features, the volume metadata,
and the page metadata. Omitting one of these caches or finds only the rest
(e.g., cache_type = "ef"
caches only the EF files, not their associated
metadata or page metadata).
File format of cache for Extracted Features files.
Defaults to getOption("hathiTools.cacheformat")
, which is "csv.gz" on
load. Allowed cache types are: compressed csv (the default), "none" (no
local caching of JSON download; only JSON file kept), "rds", "feather" and
"parquet" (suitable for use with arrow; needs the arrow package
installed), or "text2vec.csv" (a csv suitable for use with the package
text2vec).
Whether to keep the downloaded json files. Default is
TRUE
; if FALSE
, it only keeps the local cached files (e.g., the csv
files) and deletes the associated JSON files.
If TRUE
, and some JSON EF files are not found in
dir
, the function will call rsync_from_hathi to attempt to download
these first.
Default is FALSE
. If TRUE
, will attempt to use
the furrr package to cache files in parallel. You will need
to call future::plan()
beforehand to determine the specific parallel
strategy to be used; plan(multisession)
usually works fine.
A tibble with the paths of the cached files and an indicator of whether each htid was successfully cached.
# \donttest{
htids <- c("mdp.39015008706338", "mdp.39015058109706")
dir <- tempdir()
# Caches nothing (nothing has been downloaded to `dir`):
cache_htids(htids, dir = dir, cache_type = "ef")
#> 2 HTIDs cannot be cached, since their JSON EF files have not been downloaded or do not exist in the Hathi Trust rsync server.
#> Try using rsync_from_hathi(htids) to download them.
#> All existing JSON files already cached to required formats.
#> # A tibble: 0 × 5
#> # … with 5 variables: htid <chr>, local_loc <glue>, cache_format <chr>,
#> # cache_type <chr>, exists <lgl>
# Tries to rsync first, then caches
cache_htids(htids, dir = dir, cache_type = "ef", attempt_rsync = TRUE)
#> Attempting to rsync 2 Hathi Trust IDs before caching
#> Preparing to cache 2 EF files to /tmp/RtmpdUz0R0 (../..)
#> Now caching EF file for mdp.39015008706338
#> Now caching EF file for mdp.39015058109706
#> # A tibble: 2 × 5
#> htid local_loc cache…¹ cache…² exists
#> <chr> <glue> <chr> <chr> <lgl>
#> 1 mdp.39015008706338 /tmp/RtmpdUz0R0/mdp/31003/mdp.39015… csv.gz ef TRUE
#> 2 mdp.39015058109706 /tmp/RtmpdUz0R0/mdp/31500/mdp.39015… csv.gz ef TRUE
#> # … with abbreviated variable names ¹cache_format, ²cache_type
# }