Given a single Hathi Trust ID, this function returns a tibble with its volume-level metadata information. If the HT EF file corresponding to this ID has not been downloaded already, it first attempts to download it directly from the Hathi Trust server. This function uses code authored by Ben Schmidt, from his Hathidy package (https://github.com/HumanitiesDataAnalysis/hathidy).

get_hathi_meta(
  htid,
  dir = getOption("hathiTools.ef.dir"),
  cache_format = getOption("hathiTools.cacheformat")
)

Arguments

htid

The Hathi Trust id of the item whose metadata is to be read.

dir

The directory where the JSON file for the extracted features is saved. Defaults to getOption("hathiTools.ef.dir"), which is just "./hathi-ef/" on load. If the file does not exist, this function will first attempt to download it.

cache_format

Format of metadata. The default is "rds"; can also be "csv.gz", "csv", and "feather" (requires the arrow package).

Value

A tibble with the volume-level metadata for the corresponding Hathi Trust ID. This tibble can contain the following fields (taken from https://wiki.htrc.illinois.edu/pages/viewpage.action?pageId=79069329; if the field is NULL, it is not returned, so the metadata can contain fewer fields):

schemaVersion

The schema version for the metadata block of the Extracted Features file. A URL linking to the schema. "https://schemas.hathitrust.org/Extracted Features_Schema_MetadataSubSchema_v_3.0"

id

The Handle URL for the volume, which will point to the volume in the HathiTrust Digital Library. E.g. "http://hdl.handle.net/2027/mdp.39015062779023"

type

Either "Book", "PublicationVolume", or "CreativeWork" depending on the value found in the Issuance field of the corresponding Bibframe record. When the Bibframe Issuance value is 'mono', the Book type is assigned. When the value is 'serl', the PublicationVolume type is assigned. In all other cases the CreativeWork type is assigned. The Bibframe Issuance is derived from a variety of fields in the MARC record, including Control Fields and Physical Description Fields.

dateCreated

The date on which the metadata portion of the Extracted Features file is generated, in YYYYMMDD format

title

The title of the volume when the type (above) is "Book" or "CreativeWork".

alternateTitle

An alternate title for a bibliographic entity described by the Extracted Features file.

enumerationChronology

Information regarding which volume, issue, and/or year the HathiTrust volume was published.

publisher

Information about the publisher of the volume described by the Extracted Features file. Includes type, name, and id. type is either "Organization" or "Person". id is a URL identifying the publisher, such as a Handle URL, ORCID, etc. name is derived from the Imprint field in the MARC record.

pubPlace

Information about where the volume was first published. Includes id, type, and name. type is taken from the Bibframe Instance's provisionActivity's place rdf:about node, which are derived from the country codes in the MARC 008 field.

pubDate

The year in which that edition of the volume was first published.

genre

Information about the volume's genre, as determined by the cataloger of the work. Values are derived from the Genre/Form Field in the MARC record.

category

The volume's topic or topics.Derived from the Bibframe Work's ClassificationLcc node. Represents the natural language label for the Library of Congress Classification (LCC) value based upon the Library of Congress's LCC standard documentation.

language

The cataloger-determined language or languages of the volume. Taken from the Bibframe Work's language's identifiedBy's value node, which is derived from the Language Code field in the MARC record. This may differ from the automatically detected language of any given page in the page-level metadata returned by get_hathi_page_meta

accessRights

The copyright status of the volume. Corresponds to attributes in HathiTrust's accessRights database. Derived from a HathiTrust-local MARC field (974r) that is added to bibliographic records as they are processed at HathiTrust.

lastRightsUpdateDate

The most recent date the volume's copyright status was updated.

Contributor

Contains information regarding the author(s), editor(s), or other agents involved in creating the volume. Consists of id, type, and name. id is a URL taken from the Bibframe agent that links to an authorities database (e.g., VIAF). type is either the Person or Organization. Taken from the Bibframe agent type. name is The name of the person or organization who created the volume. Taken from the Bibframe agent's label. Derived from a variety of fields in the MARC record, including Main Entry Fields, Title and Title-Related Fields, and Added Entry Fields.

typeOfResource

The cataloger-determined resource type of the volume (e.g., text, image, etc.).

sourceInstitution

An array containing information about the institution that contributed the volume to HathiTrust. Always has a type node and a name node. id is a URL identifying the source institution. name is the name of the source institution.

mainEntityOfPage

An array of URLs linking to various metadata records describing the volume represented by the Extracted Features file. The array typically contains 3 URLs that point to the HathiTrust Bibliographic API: HathiTrust brief bibliographic record, HathiTrust full bibliographic record, and the HathiTrust catalog record.

oclc

The OCLC number for the volume. An OCLC number is an identifier assigned to items as they are cataloged in a library.

lcc

The Library of Congress Classification number for the volume. An LCC number is a type of call number that would be used to locate an item on a library shelf.

lccn

The Library of Congress Control Number for the volume. An LCCN is a unique number that is assigned during cataloging.

issn

The ISSN of the volume (when a journal).

isbn

The ISBN of the volume (when a book).

Details

Note that if you want to extract the metadata of more than one Hathi Trust ID at a time, it may be best to simply query the Workset Builder database using get_workset_meta, or to download the JSON files for these HTIDs first using rsync_from_hathi and then running cache_htids and read_cached_htids with the option cache_type = "meta". It is also possible to get simple metadata for large numbers of htids by downloading the big hathifile using download_hathifile and then filtering it.

Author

Ben Schmidt

Xavier Marquez

Examples

# \donttest{
# Download the 1862 version of "Democracy in America" by Tocqueville and get
# its metadata

tmp <- tempdir()

get_hathi_meta("mdp.39015001796443", dir = tmp)
#> Now caching EF file for mdp.39015001796443
#> # A tibble: 1 × 23
#>   htid         schem…¹ id    type  dateC…² title contr…³ pubDate publi…⁴ pubPl…⁵
#>   <chr>        <chr>   <chr> <chr>   <int> <chr> <chr>     <int> <chr>   <chr>  
#> 1 mdp.3901500… https:… http… "[[\…  2.02e7 Demo… "[{\"i…    1862 "{\"id… "{\"id…
#> # … with 13 more variables: language <chr>, accessRights <chr>,
#> #   accessProfile <chr>, sourceInstitution <chr>, mainEntityOfPage <chr>,
#> #   lcc <chr>, lccn <chr>, oclc <chr>, category <chr>, genre <chr>,
#> #   enumerationChronology <chr>, typeOfResource <chr>,
#> #   lastRightsUpdateDate <int>, and abbreviated variable names ¹​schemaVersion,
#> #   ²​dateCreated, ³​contributor, ⁴​publisher, ⁵​pubPlace
# }