R/hathi-ef-tools.R
get_hathi_page_meta.Rd
Given a single Hathi Trust ID, this function returns a tibble with its page-level metadata information. If the HT EF file corresponding to this ID has not been downloaded already, it first attempts to download it directly from the Hathi Trust server. This function uses code authored by Ben Schmidt, from his Hathidy package (https://github.com/HumanitiesDataAnalysis/hathidy).
The Hathi Trust id of the item whose metadata is to be read.
The directory where the JSON file for the extracted features is
saved. Defaults to getOption("hathiTools.ef.dir")
, which is just
"./hathi-ef/" on load. If the file does not exist, this function will first
attempt to download it.
Format of metadata. The default is "rds"; can also be "csv.gz", "csv", and "feather" (requires the arrow package).
A tibble with the page-level metadata for the corresponding Hathi Trust ID. The page-level metadata contains the following fields (taken from https://wiki.htrc.illinois.edu/pages/viewpage.action?pageId=79069329):
The sequence number of the page in the volume. Corresponds to the digital object, so that the first scan in the volume is "00000001", which may be the cover, a title page, or something else.
A hash of the page content used to compute the features for the page. Volumes in HathiTrust may be updated to improve scan or OCR quality or correct an issue, which would cause the text data to change, and, if features are reprocessed, a new hash would result.
The most probable language of the text on the
page. Determined algorithmically, and specified by language codes. Will be
NA
if no language detected, or if the language was not recognized by the
algorithm.
The total number of tokens detected on the page.
The total number of lines of text detected on the page.
The total number of empty lines on the page.
The total number of sentences detected on the page.
The section of the page.
The total number of tokens detected in the section of the page.
The total number of lines detected in the section of the page.
The total number of empty lines detected in the section of the page.
The total number of sentences detected in the section of the page.
The longest length of the alphabetical sequence of capital characters starting a line. Only available for the "body" section.
A JSON-formatted character column with the first non-White Space characters detected on lines in the section.
A JSON-formatted character column with the last non-White Space characters detected on lines in the section.
Note that if you want to extract the page-level metadata of more than one Hathi Trust ID at a time, it may be best to download the JSON files for these HTIDs first using rsync_from_hathi and then running this function.
# \donttest{
# Download the 1862 version of "Democracy in America" by Tocqueville and get
# its page-level metadata
tmp <- tempdir()
get_hathi_page_meta("mdp.39015001796443", dir = tmp)
#> # A tibble: 1,023 × 17
#> htid page seq version token…¹ lineC…² empty…³ sente…⁴ calcu…⁵ secti…⁶
#> <chr> <int> <chr> <chr> <int> <int> <int> <int> <chr> <int>
#> 1 mdp.3901… 4 0000… 6d5f66… 1 2 1 NA NA 1
#> 2 mdp.3901… 7 0000… 29c67b… 74 15 0 NA NA 74
#> 3 mdp.3901… 8 0000… 4e3280… 49 8 0 2 en 49
#> 4 mdp.3901… 9 0000… 0d4614… 189 24 0 10 en 189
#> 5 mdp.3901… 10 0000… 6ad989… 274 32 0 8 en 7
#> 6 mdp.3901… 10 0000… 6ad989… 274 32 0 8 en 267
#> 7 mdp.3901… 11 0000… 67d1c6… 287 33 0 10 en 6
#> 8 mdp.3901… 11 0000… 67d1c6… 287 33 0 10 en 281
#> 9 mdp.3901… 12 0000… 8863d4… 80 9 0 4 en 7
#> 10 mdp.3901… 12 0000… 8863d4… 80 9 0 4 en 73
#> # … with 1,013 more rows, 7 more variables: sectionLineCount <int>,
#> # sectionEmptyLineCount <int>, sectionSentenceCount <int>,
#> # sectionCapAlphaSeq <int>, sectionBeginCharCount <chr>,
#> # sectionEndCharCount <chr>, section <chr>, and abbreviated variable names
#> # ¹tokenCount, ²lineCount, ³emptyLineCount, ⁴sentenceCount,
#> # ⁵calculatedLanguage, ⁶sectionTokenCount
# }