Reads the page-level metadata of a single Hathi Trust Extracted Features file

Given a single Hathi Trust ID, this function returns a tibble with its page-level metadata information. If the HT EF file corresponding to this ID has not been downloaded already, it first attempts to download it directly from the Hathi Trust server. This function uses code authored by Ben Schmidt, from his Hathidy package (https://github.com/HumanitiesDataAnalysis/hathidy).

get_hathi_page_meta(
  htid,
  dir = getOption("hathiTools.ef.dir"),
  cache_format = getOption("hathiTools.cacheformat")
)

Arguments

htid: The Hathi Trust id of the item whose metadata is to be read.
dir: The directory where the JSON file for the extracted features is saved. Defaults to getOption("hathiTools.ef.dir"), which is just "./hathi-ef/" on load. If the file does not exist, this function will first attempt to download it.
cache_format: Format of metadata. The default is "rds"; can also be "csv.gz", "csv", and "feather" (requires the arrow package).

Value

A tibble with the page-level metadata for the corresponding Hathi Trust ID. The page-level metadata contains the following fields (taken from https://wiki.htrc.illinois.edu/pages/viewpage.action?pageId=79069329):

seq: The sequence number of the page in the volume. Corresponds to the digital object, so that the first scan in the volume is "00000001", which may be the cover, a title page, or something else.
Description: A hash of the page content used to compute the features for the page. Volumes in HathiTrust may be updated to improve scan or OCR quality or correct an issue, which would cause the text data to change, and, if features are reprocessed, a new hash would result.
calculatedLanguage: The most probable language of the text on the page. Determined algorithmically, and specified by language codes. Will be NA if no language detected, or if the language was not recognized by the algorithm.
tokenCount: The total number of tokens detected on the page.
lineCount: The total number of lines of text detected on the page.
emptyLineCount: The total number of empty lines on the page.
sentenceCount: The total number of sentences detected on the page.
section: The section of the page.
sectiontokenCount: The total number of tokens detected in the section of the page.
sectionlineCount: The total number of lines detected in the section of the page.
sectionemptyLineCount: The total number of empty lines detected in the section of the page.
sectionsentenceCount: The total number of sentences detected in the section of the page.
sectioncapAlphaSeq: The longest length of the alphabetical sequence of capital characters starting a line. Only available for the "body" section.
sectionBeginCharCount: A JSON-formatted character column with the first non-White Space characters detected on lines in the section.
sectionEndCharCount: A JSON-formatted character column with the last non-White Space characters detected on lines in the section.

Details

Note that if you want to extract the page-level metadata of more than one Hathi Trust ID at a time, it may be best to download the JSON files for these HTIDs first using rsync_from_hathi and then running this function.

Author

Ben Schmidt

Xavier Marquez

Examples

# \donttest{
# Download the 1862 version of "Democracy in America" by Tocqueville and get
# its page-level metadata

tmp <- tempdir()

get_hathi_page_meta("mdp.39015001796443", dir = tmp)
#> # A tibble: 1,023 × 17
#>    htid       page seq   version token…¹ lineC…² empty…³ sente…⁴ calcu…⁵ secti…⁶
#>    <chr>     <int> <chr> <chr>     <int>   <int>   <int>   <int> <chr>     <int>
#>  1 mdp.3901…     4 0000… 6d5f66…       1       2       1      NA NA            1
#>  2 mdp.3901…     7 0000… 29c67b…      74      15       0      NA NA           74
#>  3 mdp.3901…     8 0000… 4e3280…      49       8       0       2 en           49
#>  4 mdp.3901…     9 0000… 0d4614…     189      24       0      10 en          189
#>  5 mdp.3901…    10 0000… 6ad989…     274      32       0       8 en            7
#>  6 mdp.3901…    10 0000… 6ad989…     274      32       0       8 en          267
#>  7 mdp.3901…    11 0000… 67d1c6…     287      33       0      10 en            6
#>  8 mdp.3901…    11 0000… 67d1c6…     287      33       0      10 en          281
#>  9 mdp.3901…    12 0000… 8863d4…      80       9       0       4 en            7
#> 10 mdp.3901…    12 0000… 8863d4…      80       9       0       4 en           73
#> # … with 1,013 more rows, 7 more variables: sectionLineCount <int>,
#> #   sectionEmptyLineCount <int>, sectionSentenceCount <int>,
#> #   sectionCapAlphaSeq <int>, sectionBeginCharCount <chr>,
#> #   sectionEndCharCount <chr>, section <chr>, and abbreviated variable names
#> #   ¹tokenCount, ²lineCount, ³emptyLineCount, ⁴sentenceCount,
#> #   ⁵calculatedLanguage, ⁶sectionTokenCount

# }