Queries the SOLR endpoint of the Workset Builder 2.0 (beta) at https://solr2.htrc.illinois.edu/solr-ef/ to download volume metadata. This API is experimental, and so this function can fail at any time if the API changes.

get_workset_meta(
  workset,
  metadata_dir = getOption("hathiTools.metadata.dir"),
  cache = TRUE
)

Arguments

workset

A workset of htids, generated by workset_builder from Hathi Trust's SOLR endpoint. One can also pass a data frame with a column labeled "htid" and containing valid Hathi Trust htids, or a character vector of htids (though the function will complain with a warning).

metadata_dir

The directory used to cache the metadata file. Defaults to getOption("hathiTools.metadata.dir"), which is just "./metadata" on loading the package.

cache

Whether to cache the resulting metadata as a CSV. Default is TRUE. The name of the resulting metadata file is generated by appending an MD5 hash (via digest::digest) to the string "metadata-", so each metadata download will have a different name.

Value

A tibble::tibble with the Hathi Trust metadata for all the volumes in the workset or the vector of htids. #' This tibble can contain the following fields (taken from https://wiki.htrc.illinois.edu/pages/viewpage.action?pageId=79069329):

htBibUrl

The Hathi Trust Bibliographic URL for the volume, which will point to the volume in the HathiTrust Digital Library. E.g. "http://catalog.hathitrust.org/api/volumes/full/htid/aeu.ark:/13960/t00z8277t.json"

schemaVersion

The schema version for the metadata block of the Extracted Features file. A URL linking to the schema. "https://schemas.hathitrust.org/Extracted Features_Schema_MetadataSubSchema_v_3.0"

volumeIdentifier

The Hathi Trust ID

rightsAttributes

The copyright status of the volume.

title

The title of the volume.

genre

Information about the volume's genre, as determined by the cataloger of the work. Values are derived from the Genre/Form Field in the MARC record.

pubDate

The year in which that edition of the volume was first published.

pubPlace

Named list column. Information about where the volume was first published. Includes id, type, and name. type is taken from the Bibframe Instance's provisionActivity's place rdf:about node, which are derived from the country codes in the MARC 008 field.

typeOfResource

Type of resource, e.g., "text".

bibliographicFormat

Bibliographic format (e.g. "BK").

language

The cataloger-determined language or languages of the volume. Taken from the Bibframe Work's language's identifiedBy's value node, which is derived from the Language Code field in the MARC record. This may differ from the automatically detected language of any given page in the page-level metadata returned by get_hathi_page_meta

dateCreated

The date on which the metadata portion of the Extracted Features file is generated, in YYYYMMDD format

lastUpdateDate

The most recent date the volume's copyright status was updated.

imprint

Information about the publisher of the volume.

isbn

The ISBN of the volume (when a book).

issn

The ISSN of the volume (when a journal).

oclc

The OCLC number for the volume. An OCLC number is an identifier assigned to items as they are cataloged in a library.

lccn

The Library of Congress Control Number for the volume. An LCCN is a unique number that is assigned during cataloging.

classification

Library classification.

handleUrl

The Handle URL for the volume, which will point to the volume in the HathiTrust Digital Library. E.g. "http://hdl.handle.net/2027/mdp.39015062779023"

hathiTrustRecordNumber

The Hathi Trust Bibliographic record ID number.

sourceInstitutionRecordNumber

The source institution record ID number.

sourceInstitution

The source institution.

accessProfile

Type of access rights.

enumerationChronology

Information regarding which volume, issue, and/or year the HathiTrust volume was published.

governmentDocument

Whether the item is a government document.

names

Contains information regarding the author(s), editor(s), or other agents involved in creating the volume.

issuance

The cataloger-determined resource type of the volume (e.g., monographic, etc.).

subjectGenre, subjectName, subjectTitleInfo, subjectTemporal, subjectGeographic, subjectOccupation, subjectCartographics

Columns containing subject info, if present.

Details

Be mindful that downloading a large number of metadata records can take quite some time. In practice I have found that downloading full metadata from more than about 1000 records is a dicey proposition; if you need metadata for many thousands of records, you are probably better off using the big hathifile (see download_hathifile and load_raw_hathifile).

Examples

# \donttest{
dir <- tempdir()
workset <- workset_builder(name = "Tocqueville")
get_workset_meta(workset[1:5, ], metadata_dir = dir)
#> Getting download key...
#> Downloading metadata for 5 volumes. This might take some time.
#> # A tibble: 5 × 16
#>   htid       acces…¹ acces…² url   title dateC…³ lastR…⁴ pubDate schem…⁵ typeO…⁶
#>   <chr>      <chr>   <chr>   <chr> <chr>   <dbl>   <dbl>   <dbl> <chr>   <chr>  
#> 1 mdp.39015… google  ic      http… Oeuv…  2.02e7  2.02e7    1991 https:… http:/…
#> 2 mdp.39015… google  pd      http… De l…  2.02e7  2.02e7    1836 https:… http:/…
#> 3 mdp.39015… google  ic      http… Demo…  2.02e7  2.02e7    2004 https:… http:/…
#> 4 nyp.33433… google  pd      http… Demo…  2.02e7  2.02e7    1847 https:… http:/…
#> 5 uva.x0004… google  pd      http… The …  2.02e7  2.02e7    1849 https:… http:/…
#> # … with 6 more variables: language <chr>, oclc <dbl>, genre <chr>,
#> #   contributor <chr>, publisher <chr>, pubPlace <chr>, and abbreviated
#> #   variable names ¹​accessProfile, ²​accessRights, ³​dateCreated,
#> #   ⁴​lastRightsUpdateDate, ⁵​schemaVersion, ⁶​typeOfResource

## We can also pass a vector of htids:
get_workset_meta(workset$htid[1:5], metadata_dir = dir)
#> Metadata has already been downloaded. Returning cached metadata.
#> # A tibble: 5 × 16
#>   htid       acces…¹ acces…² url   title dateC…³ lastR…⁴ pubDate schem…⁵ typeO…⁶
#>   <chr>      <chr>   <chr>   <chr> <chr>   <dbl>   <dbl>   <dbl> <chr>   <chr>  
#> 1 mdp.39015… google  ic      http… Oeuv…  2.02e7  2.02e7    1991 https:… http:/…
#> 2 mdp.39015… google  pd      http… De l…  2.02e7  2.02e7    1836 https:… http:/…
#> 3 mdp.39015… google  ic      http… Demo…  2.02e7  2.02e7    2004 https:… http:/…
#> 4 nyp.33433… google  pd      http… Demo…  2.02e7  2.02e7    1847 https:… http:/…
#> 5 uva.x0004… google  pd      http… The …  2.02e7  2.02e7    1849 https:… http:/…
#> # … with 6 more variables: language <chr>, oclc <dbl>, genre <chr>,
#> #   contributor <chr>, publisher <chr>, pubPlace <chr>, and abbreviated
#> #   variable names ¹​accessProfile, ²​accessRights, ³​dateCreated,
#> #   ⁴​lastRightsUpdateDate, ⁵​schemaVersion, ⁶​typeOfResource
# }