Given a single Hathi Trust ID, this function returns a tibble with its page-level metadata information. If the HT EF file corresponding to this ID has not been downloaded already, it first attempts to download it directly from the Hathi Trust server. This function uses code authored by Ben Schmidt, from his Hathidy package (https://github.com/HumanitiesDataAnalysis/hathidy).

get_hathi_page_meta(
  htid,
  dir = getOption("hathiTools.ef.dir"),
  cache_format = getOption("hathiTools.cacheformat")
)

Arguments

htid

The Hathi Trust id of the item whose metadata is to be read.

dir

The directory where the JSON file for the extracted features is saved. Defaults to getOption("hathiTools.ef.dir"), which is just "./hathi-ef/" on load. If the file does not exist, this function will first attempt to download it.

cache_format

Format of metadata. The default is "rds"; can also be "csv.gz", "csv", and "feather" (requires the arrow package).

Value

A tibble with the page-level metadata for the corresponding Hathi Trust ID. The page-level metadata contains the following fields (taken from https://wiki.htrc.illinois.edu/pages/viewpage.action?pageId=79069329):

seq

The sequence number of the page in the volume. Corresponds to the digital object, so that the first scan in the volume is "00000001", which may be the cover, a title page, or something else.

Description

A hash of the page content used to compute the features for the page. Volumes in HathiTrust may be updated to improve scan or OCR quality or correct an issue, which would cause the text data to change, and, if features are reprocessed, a new hash would result.

calculatedLanguage

The most probable language of the text on the page. Determined algorithmically, and specified by language codes. Will be NA if no language detected, or if the language was not recognized by the algorithm.

tokenCount

The total number of tokens detected on the page.

lineCount

The total number of lines of text detected on the page.

emptyLineCount

The total number of empty lines on the page.

sentenceCount

The total number of sentences detected on the page.

section

The section of the page.

sectiontokenCount

The total number of tokens detected in the section of the page.

sectionlineCount

The total number of lines detected in the section of the page.

sectionemptyLineCount

The total number of empty lines detected in the section of the page.

sectionsentenceCount

The total number of sentences detected in the section of the page.

sectioncapAlphaSeq

The longest length of the alphabetical sequence of capital characters starting a line. Only available for the "body" section.

sectionBeginCharCount

A JSON-formatted character column with the first non-White Space characters detected on lines in the section.

sectionEndCharCount

A JSON-formatted character column with the last non-White Space characters detected on lines in the section.

Details

Note that if you want to extract the page-level metadata of more than one Hathi Trust ID at a time, it may be best to download the JSON files for these HTIDs first using rsync_from_hathi and then running this function.

Author

Ben Schmidt

Xavier Marquez

Examples

# \donttest{
# Download the 1862 version of "Democracy in America" by Tocqueville and get
# its page-level metadata

tmp <- tempdir()

get_hathi_page_meta("mdp.39015001796443", dir = tmp)
#> # A tibble: 1,023 × 17
#>    htid       page seq   version token…¹ lineC…² empty…³ sente…⁴ calcu…⁵ secti…⁶
#>    <chr>     <int> <chr> <chr>     <int>   <int>   <int>   <int> <chr>     <int>
#>  1 mdp.3901…     4 0000… 6d5f66…       1       2       1      NA NA            1
#>  2 mdp.3901…     7 0000… 29c67b…      74      15       0      NA NA           74
#>  3 mdp.3901…     8 0000… 4e3280…      49       8       0       2 en           49
#>  4 mdp.3901…     9 0000… 0d4614…     189      24       0      10 en          189
#>  5 mdp.3901…    10 0000… 6ad989…     274      32       0       8 en            7
#>  6 mdp.3901…    10 0000… 6ad989…     274      32       0       8 en          267
#>  7 mdp.3901…    11 0000… 67d1c6…     287      33       0      10 en            6
#>  8 mdp.3901…    11 0000… 67d1c6…     287      33       0      10 en          281
#>  9 mdp.3901…    12 0000… 8863d4…      80       9       0       4 en            7
#> 10 mdp.3901…    12 0000… 8863d4…      80       9       0       4 en           73
#> # … with 1,013 more rows, 7 more variables: sectionLineCount <int>,
#> #   sectionEmptyLineCount <int>, sectionSentenceCount <int>,
#> #   sectionCapAlphaSeq <int>, sectionBeginCharCount <chr>,
#> #   sectionEndCharCount <chr>, section <chr>, and abbreviated variable names
#> #   ¹​tokenCount, ²​lineCount, ³​emptyLineCount, ⁴​sentenceCount,
#> #   ⁵​calculatedLanguage, ⁶​sectionTokenCount

# }