Takes a set of Hathi Trust IDs and reads their extracted features and associated (page- and volume- level) metadata into memory or into an arrow Dataset. A typical workflow with this package should normally involve selecting an appropriate set of Hathi Trust IDs (via workset_builder), downloading their Extracted Features files to your local machine (via rsync_from_hathi), caching these slow-to-load JSON Extracted Features files to a faster-loading format using cache_htids, and then using read_cached_htids to read them into a single data frame or arrow Dataset for further work.

read_cached_htids(
  htids,
  dir = getOption("hathiTools.ef.dir"),
  cache_type = c("ef", "meta", "pagemeta"),
  cache_format = getOption("hathiTools.cacheformat"),
  nest_char_count = FALSE
)

Arguments

htids

A character vector of Hathi Trust ids, a workset created with workset_builder, or a data frame with a column named "htid" containing the Hathi Trust ids that require caching. If the JSON Extracted Features files for these htids have not been downloaded via rsync_from_hathi or get_hathi_counts to dir, nothing will be cached (unless attempt_rsync is TRUE).

dir

The directory where the download extracted features files are to be found. Defaults to getOption("hathiTools.ef.dir"), which is just "hathi-ef" on load.

cache_type

Type of information cached. The default is c("ef", "meta", "pagemeta"), which refers to the extracted features, the volume metadata, and the page metadata. Omitting one of these caches or finds only the rest (e.g., cache_type = "ef" caches only the EF files, not their associated metadata or page metadata).

cache_format

File format of cache for Extracted Features files. Defaults to getOption("hathiTools.cacheformat"), which is "csv.gz" on load. Allowed cache types are: compressed csv (the default), "none" (no local caching of JSON download; only JSON file kept), "rds", "feather" and "parquet" (suitable for use with arrow; needs the arrow package installed), or "text2vec.csv" (a csv suitable for use with the package text2vec).

nest_char_count

Whether to create a column with a tibble for the sectionBeginCharCount and sectionEndCharCount columns in the page metadata. The default is FALSE; if so the counts of characters at the beginning and end of lines are left as a JSON-formatted string (which can in turn be transformed into a tibble manually).

Value

A tibble with the extracted features, plus the desired (volume-level or page-level) metadata, or an arrow Dataset.

Examples

# \donttest{
htids <- c("mdp.39015008706338", "mdp.39015058109706")
dir <- tempdir()

# Download and cache files first:

cache_htids(htids, dir = dir, cache_type = "ef", attempt_rsync = TRUE)
#> 2 HTIDs have already been cached to csv.gz format.
#> All existing JSON files already cached to required formats.
#> # A tibble: 2 × 5
#>   htid               local_loc                            cache…¹ cache…² exists
#>   <chr>              <glue>                               <chr>   <chr>   <lgl> 
#> 1 mdp.39015008706338 /tmp/RtmpdUz0R0/mdp/31003/mdp.39015… csv.gz  ef      TRUE  
#> 2 mdp.39015058109706 /tmp/RtmpdUz0R0/mdp/31500/mdp.39015… csv.gz  ef      TRUE  
#> # … with abbreviated variable names ¹​cache_format, ²​cache_type

# Now read them into memory:

efs <- read_cached_htids(htids, dir = dir)
efs
#> # A tibble: 430,772 × 6
#>    htid               token POS   count section  page
#>    <chr>              <chr> <chr> <int> <chr>   <int>
#>  1 mdp.39015008706338 E     UNK       3 body        2
#>  2 mdp.39015008706338 s     UNK       1 body        2
#>  3 mdp.39015008706338 .     UNK       1 body        2
#>  4 mdp.39015008706338 N     UNK       2 body        2
#>  5 mdp.39015008706338 IllE  UNK       1 body        2
#>  6 mdp.39015008706338 ::    UNK       1 body        2
#>  7 mdp.39015008706338 |     UNK       3 body        2
#>  8 mdp.39015008706338 -'    UNK       1 body        2
#>  9 mdp.39015008706338 -     UNK       6 body        2
#> 10 mdp.39015008706338 #     UNK       1 body        2
#> # … with 430,762 more rows

# }