Get metadata for a set of Hathi Trust IDs — get_workset

Queries the SOLR endpoint of the Workset Builder 2.0 (beta) at https://solr2.htrc.illinois.edu/solr-ef/ to download volume metadata. This API is experimental, and so this function can fail at any time if the API changes.

get_workset_meta(
  workset,
  metadata_dir = getOption("hathiTools.metadata.dir"),
  cache = TRUE
)

Arguments

workset: A workset of htids, generated by workset_builder from Hathi Trust's SOLR endpoint. One can also pass a data frame with a column labeled "htid" and containing valid Hathi Trust htids, or a character vector of htids (though the function will complain with a warning).
metadata_dir: The directory used to cache the metadata file. Defaults to getOption("hathiTools.metadata.dir"), which is just "./metadata" on loading the package.
cache: Whether to cache the resulting metadata as a CSV. Default is TRUE. The name of the resulting metadata file is generated by appending an MD5 hash (via digest::digest) to the string "metadata-", so each metadata download will have a different name.

Value

A tibble::tibble with the Hathi Trust metadata for all the volumes in the workset or the vector of htids. #' This tibble can contain the following fields (taken from https://wiki.htrc.illinois.edu/pages/viewpage.action?pageId=79069329):

htBibUrl: The Hathi Trust Bibliographic URL for the volume, which will point to the volume in the HathiTrust Digital Library. E.g. "http://catalog.hathitrust.org/api/volumes/full/htid/aeu.ark:/13960/t00z8277t.json"
schemaVersion: The schema version for the metadata block of the Extracted Features file. A URL linking to the schema. "https://schemas.hathitrust.org/Extracted Features_Schema_MetadataSubSchema_v_3.0"
volumeIdentifier: The Hathi Trust ID
rightsAttributes: The copyright status of the volume.
title: The title of the volume.
genre: Information about the volume's genre, as determined by the cataloger of the work. Values are derived from the Genre/Form Field in the MARC record.
pubDate: The year in which that edition of the volume was first published.
pubPlace: Named list column. Information about where the volume was first published. Includes id, type, and name. type is taken from the Bibframe Instance's provisionActivity's place rdf:about node, which are derived from the country codes in the MARC 008 field.
typeOfResource: Type of resource, e.g., "text".
bibliographicFormat: Bibliographic format (e.g. "BK").
language: The cataloger-determined language or languages of the volume. Taken from the Bibframe Work's language's identifiedBy's value node, which is derived from the Language Code field in the MARC record. This may differ from the automatically detected language of any given page in the page-level metadata returned by get_hathi_page_meta
dateCreated: The date on which the metadata portion of the Extracted Features file is generated, in YYYYMMDD format
lastUpdateDate: The most recent date the volume's copyright status was updated.
imprint: Information about the publisher of the volume.
isbn: The ISBN of the volume (when a book).
issn: The ISSN of the volume (when a journal).
oclc: The OCLC number for the volume. An OCLC number is an identifier assigned to items as they are cataloged in a library.
lccn: The Library of Congress Control Number for the volume. An LCCN is a unique number that is assigned during cataloging.
classification: Library classification.
handleUrl: The Handle URL for the volume, which will point to the volume in the HathiTrust Digital Library. E.g. "http://hdl.handle.net/2027/mdp.39015062779023"
hathiTrustRecordNumber: The Hathi Trust Bibliographic record ID number.
sourceInstitutionRecordNumber: The source institution record ID number.
sourceInstitution: The source institution.
accessProfile: Type of access rights.
enumerationChronology: Information regarding which volume, issue, and/or year the HathiTrust volume was published.
governmentDocument: Whether the item is a government document.
names: Contains information regarding the author(s), editor(s), or other agents involved in creating the volume.
issuance: The cataloger-determined resource type of the volume (e.g., monographic, etc.).
subjectGenre, subjectName, subjectTitleInfo, subjectTemporal, subjectGeographic, subjectOccupation, subjectCartographics: Columns containing subject info, if present.

Details

Be mindful that downloading a large number of metadata records can take quite some time. In practice I have found that downloading full metadata from more than about 1000 records is a dicey proposition; if you need metadata for many thousands of records, you are probably better off using the big hathifile (see download_hathifile and load_raw_hathifile).

Examples

# \donttest{
dir <- tempdir()
workset <- workset_builder(name = "Tocqueville")
get_workset_meta(workset[1:5, ], metadata_dir = dir)
#> Getting download key...
#> Downloading metadata for 5 volumes. This might take some time.
#> # A tibble: 5 × 16
#>   htid       acces…¹ acces…² url   title dateC…³ lastR…⁴ pubDate schem…⁵ typeO…⁶
#>   <chr>      <chr>   <chr>   <chr> <chr>   <dbl>   <dbl>   <dbl> <chr>   <chr>  
#> 1 mdp.39015… google  ic      http… Oeuv…  2.02e7  2.02e7    1991 https:… http:/…
#> 2 mdp.39015… google  pd      http… De l…  2.02e7  2.02e7    1836 https:… http:/…
#> 3 mdp.39015… google  ic      http… Demo…  2.02e7  2.02e7    2004 https:… http:/…
#> 4 nyp.33433… google  pd      http… Demo…  2.02e7  2.02e7    1847 https:… http:/…
#> 5 uva.x0004… google  pd      http… The …  2.02e7  2.02e7    1849 https:… http:/…
#> # … with 6 more variables: language <chr>, oclc <dbl>, genre <chr>,
#> #   contributor <chr>, publisher <chr>, pubPlace <chr>, and abbreviated
#> #   variable names ¹accessProfile, ²accessRights, ³dateCreated,
#> #   ⁴lastRightsUpdateDate, ⁵schemaVersion, ⁶typeOfResource

## We can also pass a vector of htids:
get_workset_meta(workset$htid[1:5], metadata_dir = dir)
#> Metadata has already been downloaded. Returning cached metadata.
#> # A tibble: 5 × 16
#>   htid       acces…¹ acces…² url   title dateC…³ lastR…⁴ pubDate schem…⁵ typeO…⁶
#>   <chr>      <chr>   <chr>   <chr> <chr>   <dbl>   <dbl>   <dbl> <chr>   <chr>  
#> 1 mdp.39015… google  ic      http… Oeuv…  2.02e7  2.02e7    1991 https:… http:/…
#> 2 mdp.39015… google  pd      http… De l…  2.02e7  2.02e7    1836 https:… http:/…
#> 3 mdp.39015… google  ic      http… Demo…  2.02e7  2.02e7    2004 https:… http:/…
#> 4 nyp.33433… google  pd      http… Demo…  2.02e7  2.02e7    1847 https:… http:/…
#> 5 uva.x0004… google  pd      http… The …  2.02e7  2.02e7    1849 https:… http:/…
#> # … with 6 more variables: language <chr>, oclc <dbl>, genre <chr>,
#> #   contributor <chr>, publisher <chr>, pubPlace <chr>, and abbreviated
#> #   variable names ¹accessProfile, ²accessRights, ³dateCreated,
#> #   ⁴lastRightsUpdateDate, ⁵schemaVersion, ⁶typeOfResource
# }