R/cache_tools.R
find_cached_htids.Rd
Finds cached Extracted Features files for a set of HT ids
A character vector of Hathi Trust ids, a workset created with
workset_builder, or a data frame with a column named "htid" containing
the Hathi Trust ids that require caching. If the JSON Extracted Features
files for these htids have not been downloaded via rsync_from_hathi or
get_hathi_counts to dir
, nothing will be cached (unless attempt_rsync
is TRUE
).
The directory where the download extracted features files are to
be found. Defaults to getOption("hathiTools.ef.dir")
, which is just
"hathi-ef" on load.
Type of information cached. The default is c("ef", "meta",
"pagemeta"), which refers to the extracted features, the volume metadata,
and the page metadata. Omitting one of these caches or finds only the rest
(e.g., cache_type = "ef"
caches only the EF files, not their associated
metadata or page metadata).
File format of cache for Extracted Features files.
Defaults to getOption("hathiTools.cacheformat")
, which is "csv.gz" on
load. Allowed cache types are: compressed csv (the default), "none" (no
local caching of JSON download; only JSON file kept), "rds", "feather" and
"parquet" (suitable for use with arrow; needs the arrow package
installed), or "text2vec.csv" (a csv suitable for use with the package
text2vec).
Whether to return only file paths to files that actually
exist. Default is TRUE
. Use FALSE
to find whether some files still need
to be cached.
A tibble with the paths of the cached files and an indicator of whether each htid has an existing cached file.
# \donttest{
htids <- c("mdp.39015008706338", "mdp.39015058109706")
dir <- tempdir()
# Finds nothing (nothing has been downloaded or cached to `dir`):
find_cached_htids(htids, cache_format = c("none", "csv"), dir = dir)
#> # A tibble: 0 × 5
#> # … with 5 variables: htid <chr>, local_loc <glue>, cache_format <chr>,
#> # cache_type <chr>, exists <lgl>
cache_htids(htids, dir = dir, cache_type = "ef", attempt_rsync = TRUE)
#> Attempting to rsync 2 Hathi Trust IDs before caching
#> Preparing to cache 2 EF files to /tmp/RtmpdUz0R0 (../..)
#> Now caching EF file for mdp.39015008706338
#> Now caching EF file for mdp.39015058109706
#> # A tibble: 2 × 5
#> htid local_loc cache…¹ cache…² exists
#> <chr> <glue> <chr> <chr> <lgl>
#> 1 mdp.39015008706338 /tmp/RtmpdUz0R0/mdp/31003/mdp.39015… csv.gz ef TRUE
#> 2 mdp.39015058109706 /tmp/RtmpdUz0R0/mdp/31500/mdp.39015… csv.gz ef TRUE
#> # … with abbreviated variable names ¹cache_format, ²cache_type
# Finds the cached files and their JSON ef files
find_cached_htids(htids, cache_format = c("none", "csv"), dir = dir)
#> # A tibble: 4 × 5
#> htid local_loc cache…¹ cache…² exists
#> <chr> <glue> <chr> <chr> <lgl>
#> 1 mdp.39015008706338 /tmp/RtmpdUz0R0/mdp/31003/mdp.39015… json.b… none TRUE
#> 2 mdp.39015058109706 /tmp/RtmpdUz0R0/mdp/31500/mdp.39015… json.b… none TRUE
#> 3 mdp.39015008706338 /tmp/RtmpdUz0R0/mdp/31003/mdp.39015… csv.gz ef TRUE
#> 4 mdp.39015058109706 /tmp/RtmpdUz0R0/mdp/31500/mdp.39015… csv.gz ef TRUE
#> # … with abbreviated variable names ¹cache_format, ²cache_type
# }