R/hathi-ef-tools.R
get_hathi_meta.Rd
Given a single Hathi Trust ID, this function returns a tibble with its volume-level metadata information. If the HT EF file corresponding to this ID has not been downloaded already, it first attempts to download it directly from the Hathi Trust server. This function uses code authored by Ben Schmidt, from his Hathidy package (https://github.com/HumanitiesDataAnalysis/hathidy).
The Hathi Trust id of the item whose metadata is to be read.
The directory where the JSON file for the extracted features is
saved. Defaults to getOption("hathiTools.ef.dir")
, which is just
"./hathi-ef/" on load. If the file does not exist, this function will first
attempt to download it.
Format of metadata. The default is "rds"; can also be "csv.gz", "csv", and "feather" (requires the arrow package).
A tibble with the volume-level metadata for the
corresponding Hathi Trust ID. This tibble can contain the
following fields (taken from
https://wiki.htrc.illinois.edu/pages/viewpage.action?pageId=79069329;
if the field is NULL
, it is not returned, so the metadata can contain
fewer fields):
The schema version for the metadata block of the Extracted Features file. A URL linking to the schema. "https://schemas.hathitrust.org/Extracted Features_Schema_MetadataSubSchema_v_3.0"
The Handle URL for the volume, which will point to the volume in the HathiTrust Digital Library. E.g. "http://hdl.handle.net/2027/mdp.39015062779023"
Either "Book", "PublicationVolume", or "CreativeWork" depending on the value found in the Issuance field of the corresponding Bibframe record. When the Bibframe Issuance value is 'mono', the Book type is assigned. When the value is 'serl', the PublicationVolume type is assigned. In all other cases the CreativeWork type is assigned. The Bibframe Issuance is derived from a variety of fields in the MARC record, including Control Fields and Physical Description Fields.
The date on which the metadata portion of the Extracted Features file is generated, in YYYYMMDD format
The title of the volume when the type (above) is "Book" or "CreativeWork".
An alternate title for a bibliographic entity described by the Extracted Features file.
Information regarding which volume, issue, and/or year the HathiTrust volume was published.
Information about the publisher of the volume described by
the Extracted Features file. Includes type, name, and id. type
is either
"Organization" or "Person". id
is a URL identifying the publisher, such as
a Handle URL, ORCID, etc. name
is derived from the Imprint field in the
MARC record.
Information about where the volume was first published.
Includes id, type, and name. type
is taken from the Bibframe Instance's
provisionActivity's place rdf:about node, which are derived from the country
codes in the MARC 008 field.
The year in which that edition of the volume was first published.
Information about the volume's genre, as determined by the cataloger of the work. Values are derived from the Genre/Form Field in the MARC record.
The volume's topic or topics.Derived from the Bibframe Work's ClassificationLcc node. Represents the natural language label for the Library of Congress Classification (LCC) value based upon the Library of Congress's LCC standard documentation.
The cataloger-determined language or languages of the volume. Taken from the Bibframe Work's language's identifiedBy's value node, which is derived from the Language Code field in the MARC record. This may differ from the automatically detected language of any given page in the page-level metadata returned by get_hathi_page_meta
The copyright status of the volume. Corresponds to attributes in HathiTrust's accessRights database. Derived from a HathiTrust-local MARC field (974r) that is added to bibliographic records as they are processed at HathiTrust.
The most recent date the volume's copyright status was updated.
Contains information regarding the author(s), editor(s), or other agents involved in creating the volume. Consists of id, type, and name. id is a URL taken from the Bibframe agent that links to an authorities database (e.g., VIAF). type is either the Person or Organization. Taken from the Bibframe agent type. name is The name of the person or organization who created the volume. Taken from the Bibframe agent's label. Derived from a variety of fields in the MARC record, including Main Entry Fields, Title and Title-Related Fields, and Added Entry Fields.
The cataloger-determined resource type of the volume (e.g., text, image, etc.).
An array containing information about the
institution that contributed the volume to HathiTrust. Always has a type
node and a name node. id
is a URL identifying the source institution.
name
is the name of the source institution.
An array of URLs linking to various metadata records describing the volume represented by the Extracted Features file. The array typically contains 3 URLs that point to the HathiTrust Bibliographic API: HathiTrust brief bibliographic record, HathiTrust full bibliographic record, and the HathiTrust catalog record.
The OCLC number for the volume. An OCLC number is an identifier assigned to items as they are cataloged in a library.
The Library of Congress Classification number for the volume. An LCC number is a type of call number that would be used to locate an item on a library shelf.
The Library of Congress Control Number for the volume. An LCCN is a unique number that is assigned during cataloging.
The ISSN of the volume (when a journal).
The ISBN of the volume (when a book).
Note that if you want to extract the metadata of more than one Hathi Trust ID
at a time, it may be best to simply query the Workset Builder database using
get_workset_meta, or to download the JSON files for these HTIDs first using
rsync_from_hathi and then running cache_htids and read_cached_htids with
the option cache_type = "meta"
. It is also possible to get simple metadata
for large numbers of htids by downloading the big hathifile using
download_hathifile and then filtering it.
# \donttest{
# Download the 1862 version of "Democracy in America" by Tocqueville and get
# its metadata
tmp <- tempdir()
get_hathi_meta("mdp.39015001796443", dir = tmp)
#> Now caching EF file for mdp.39015001796443
#> # A tibble: 1 × 23
#> htid schem…¹ id type dateC…² title contr…³ pubDate publi…⁴ pubPl…⁵
#> <chr> <chr> <chr> <chr> <int> <chr> <chr> <int> <chr> <chr>
#> 1 mdp.3901500… https:… http… "[[\… 2.02e7 Demo… "[{\"i… 1862 "{\"id… "{\"id…
#> # … with 13 more variables: language <chr>, accessRights <chr>,
#> # accessProfile <chr>, sourceInstitution <chr>, mainEntityOfPage <chr>,
#> # lcc <chr>, lccn <chr>, oclc <chr>, category <chr>, genre <chr>,
#> # enumerationChronology <chr>, typeOfResource <chr>,
#> # lastRightsUpdateDate <int>, and abbreviated variable names ¹schemaVersion,
#> # ²dateCreated, ³contributor, ⁴publisher, ⁵pubPlace
# }