Finds cached Extracted Features files for a set of HT ids

find_cached_htids(
  htids,
  dir = getOption("hathiTools.ef.dir"),
  cache_type = c("ef", "meta", "pagemeta"),
  cache_format = getOption("hathiTools.cacheformat"),
  existing_only = TRUE
)

Arguments

htids

A character vector of Hathi Trust ids, a workset created with workset_builder, or a data frame with a column named "htid" containing the Hathi Trust ids that require caching. If the JSON Extracted Features files for these htids have not been downloaded via rsync_from_hathi or get_hathi_counts to dir, nothing will be cached (unless attempt_rsync is TRUE).

dir

The directory where the download extracted features files are to be found. Defaults to getOption("hathiTools.ef.dir"), which is just "hathi-ef" on load.

cache_type

Type of information cached. The default is c("ef", "meta", "pagemeta"), which refers to the extracted features, the volume metadata, and the page metadata. Omitting one of these caches or finds only the rest (e.g., cache_type = "ef" caches only the EF files, not their associated metadata or page metadata).

cache_format

File format of cache for Extracted Features files. Defaults to getOption("hathiTools.cacheformat"), which is "csv.gz" on load. Allowed cache types are: compressed csv (the default), "none" (no local caching of JSON download; only JSON file kept), "rds", "feather" and "parquet" (suitable for use with arrow; needs the arrow package installed), or "text2vec.csv" (a csv suitable for use with the package text2vec).

existing_only

Whether to return only file paths to files that actually exist. Default is TRUE. Use FALSE to find whether some files still need to be cached.

Value

A tibble with the paths of the cached files and an indicator of whether each htid has an existing cached file.

Examples

# \donttest{
htids <- c("mdp.39015008706338", "mdp.39015058109706")
dir <- tempdir()

# Finds nothing (nothing has been downloaded or cached to `dir`):

find_cached_htids(htids, cache_format = c("none", "csv"), dir = dir)
#> # A tibble: 0 × 5
#> # … with 5 variables: htid <chr>, local_loc <glue>, cache_format <chr>,
#> #   cache_type <chr>, exists <lgl>

cache_htids(htids, dir = dir, cache_type = "ef", attempt_rsync = TRUE)
#> Attempting to rsync 2 Hathi Trust IDs before caching
#> Preparing to cache 2 EF files to /tmp/RtmpdUz0R0 (../..) 
#> Now caching EF file for mdp.39015008706338
#> Now caching EF file for mdp.39015058109706
#> # A tibble: 2 × 5
#>   htid               local_loc                            cache…¹ cache…² exists
#>   <chr>              <glue>                               <chr>   <chr>   <lgl> 
#> 1 mdp.39015008706338 /tmp/RtmpdUz0R0/mdp/31003/mdp.39015… csv.gz  ef      TRUE  
#> 2 mdp.39015058109706 /tmp/RtmpdUz0R0/mdp/31500/mdp.39015… csv.gz  ef      TRUE  
#> # … with abbreviated variable names ¹​cache_format, ²​cache_type

# Finds the cached files and their JSON ef files

find_cached_htids(htids, cache_format = c("none", "csv"), dir = dir)
#> # A tibble: 4 × 5
#>   htid               local_loc                            cache…¹ cache…² exists
#>   <chr>              <glue>                               <chr>   <chr>   <lgl> 
#> 1 mdp.39015008706338 /tmp/RtmpdUz0R0/mdp/31003/mdp.39015… json.b… none    TRUE  
#> 2 mdp.39015058109706 /tmp/RtmpdUz0R0/mdp/31500/mdp.39015… json.b… none    TRUE  
#> 3 mdp.39015008706338 /tmp/RtmpdUz0R0/mdp/31003/mdp.39015… csv.gz  ef      TRUE  
#> 4 mdp.39015058109706 /tmp/RtmpdUz0R0/mdp/31500/mdp.39015… csv.gz  ef      TRUE  
#> # … with abbreviated variable names ¹​cache_format, ²​cache_type
# }