Given a single Hathi Trust ID, this function returns a tibble with its per-page word count and part of speech information, and caches the results to the getOption("hathiTools.ef.dir") directory (by default "./hathi-ef"). If the file has not been cached already, it first attempts to download it directly from the Hathi Trust server. This function uses code authored by Ben Schmidt, from his Hathidy package (https://github.com/HumanitiesDataAnalysis/hathidy).

get_hathi_counts(
  htid,
  dir = getOption("hathiTools.ef.dir"),
  cache_format = getOption("hathiTools.cacheformat")
)

Arguments

htid

The Hathi Trust id of the item whose extracted features files are to be loaded into memory. If it hasn't been downloaded, the function will try to download it first.

dir

The directory where the download extracted features files are to be found. Defaults to getOption("hathiTools.ef.dir"), which is just "hathi-ef" on load.

cache_format

File format of cache for Extracted Features files. Defaults to getOption("hathiTools.cacheformat"), which is "csv.gz" on load. Allowed cache types are: compressed csv (the default), "none" (no local caching of JSON download; only JSON file kept), "rds", "feather" and "parquet" (suitable for use with arrow; needs the arrow package installed), or "text2vec.csv" (a csv suitable for use with the package text2vec).

Value

a tibble with the extracted features.

Author

Ben Schmidt

Examples

# \donttest{
# Download the 1863 version of "Democracy in America" by Tocqueville and get
# its extracted features

tmp <- tempdir()

get_hathi_counts("aeu.ark:/13960/t3qv43c3w", dir = tmp)
#> Now caching EF file for aeu.ark:/13960/t3qv43c3w
#> # A tibble: 137,182 × 6
#>    htid                     token  POS   count section  page
#>    <chr>                    <chr>  <chr> <int> <chr>   <int>
#>  1 aeu.ark:/13960/t3qv43c3w "iJi"  NN        1 body        1
#>  2 aeu.ark:/13960/t3qv43c3w ".8"   CD        1 body        1
#>  3 aeu.ark:/13960/t3qv43c3w "11.6" CD        1 body        1
#>  4 aeu.ark:/13960/t3qv43c3w "."    .         2 body        1
#>  5 aeu.ark:/13960/t3qv43c3w "33"   CD        1 body        1
#>  6 aeu.ark:/13960/t3qv43c3w "U"    NNP       1 body        1
#>  7 aeu.ark:/13960/t3qv43c3w "TEST" NN        1 body        1
#>  8 aeu.ark:/13960/t3qv43c3w "\\"   SYM       2 body        1
#>  9 aeu.ark:/13960/t3qv43c3w "MT-3" NN        1 body        1
#> 10 aeu.ark:/13960/t3qv43c3w "J2"   NN        1 body        1
#> # … with 137,172 more rows

# }