R/hathi-ef-tools.R
get_hathi_counts.Rd
Given a single Hathi Trust ID, this function returns a
tibble with its per-page word count and part of speech
information, and caches the results to the getOption("hathiTools.ef.dir")
directory (by default "./hathi-ef"). If the file has not been cached already,
it first attempts to download it directly from the Hathi Trust server. This
function uses code authored by Ben Schmidt, from his Hathidy package
(https://github.com/HumanitiesDataAnalysis/hathidy).
The Hathi Trust id of the item whose extracted features files are to be loaded into memory. If it hasn't been downloaded, the function will try to download it first.
The directory where the download extracted features files are to
be found. Defaults to getOption("hathiTools.ef.dir")
, which is just
"hathi-ef" on load.
File format of cache for Extracted Features files.
Defaults to getOption("hathiTools.cacheformat")
, which is "csv.gz" on
load. Allowed cache types are: compressed csv (the default), "none" (no
local caching of JSON download; only JSON file kept), "rds", "feather" and
"parquet" (suitable for use with arrow; needs the arrow package
installed), or "text2vec.csv" (a csv suitable for use with the package
text2vec).
a tibble with the extracted features.
# \donttest{
# Download the 1863 version of "Democracy in America" by Tocqueville and get
# its extracted features
tmp <- tempdir()
get_hathi_counts("aeu.ark:/13960/t3qv43c3w", dir = tmp)
#> Now caching EF file for aeu.ark:/13960/t3qv43c3w
#> # A tibble: 137,182 × 6
#> htid token POS count section page
#> <chr> <chr> <chr> <int> <chr> <int>
#> 1 aeu.ark:/13960/t3qv43c3w "iJi" NN 1 body 1
#> 2 aeu.ark:/13960/t3qv43c3w ".8" CD 1 body 1
#> 3 aeu.ark:/13960/t3qv43c3w "11.6" CD 1 body 1
#> 4 aeu.ark:/13960/t3qv43c3w "." . 2 body 1
#> 5 aeu.ark:/13960/t3qv43c3w "33" CD 1 body 1
#> 6 aeu.ark:/13960/t3qv43c3w "U" NNP 1 body 1
#> 7 aeu.ark:/13960/t3qv43c3w "TEST" NN 1 body 1
#> 8 aeu.ark:/13960/t3qv43c3w "\\" SYM 2 body 1
#> 9 aeu.ark:/13960/t3qv43c3w "MT-3" NN 1 body 1
#> 10 aeu.ark:/13960/t3qv43c3w "J2" NN 1 body 1
#> # … with 137,172 more rows
# }