Queries the SOLR endpoint of the Workset Builder 2.0 (beta) at https://solr2.htrc.illinois.edu/solr-ef/ to download volume metadata. This API is experimental, and so this function can fail at any time if the API changes.
get_workset_meta(
workset,
metadata_dir = getOption("hathiTools.metadata.dir"),
cache = TRUE
)
A workset of htids, generated by workset_builder from Hathi Trust's SOLR endpoint. One can also pass a data frame with a column labeled "htid" and containing valid Hathi Trust htids, or a character vector of htids (though the function will complain with a warning).
The directory used to cache the metadata file. Defaults
to getOption("hathiTools.metadata.dir")
, which is just "./metadata" on
loading the package.
Whether to cache the resulting metadata as a CSV. Default is TRUE. The name of the resulting metadata file is generated by appending an MD5 hash (via digest::digest) to the string "metadata-", so each metadata download will have a different name.
A tibble::tibble with the Hathi Trust metadata for all the volumes in the workset or the vector of htids. #' This tibble can contain the following fields (taken from https://wiki.htrc.illinois.edu/pages/viewpage.action?pageId=79069329):
The Hathi Trust Bibliographic URL for the volume, which will point to the volume in the HathiTrust Digital Library. E.g. "http://catalog.hathitrust.org/api/volumes/full/htid/aeu.ark:/13960/t00z8277t.json"
The schema version for the metadata block of the Extracted Features file. A URL linking to the schema. "https://schemas.hathitrust.org/Extracted Features_Schema_MetadataSubSchema_v_3.0"
The Hathi Trust ID
The copyright status of the volume.
The title of the volume.
Information about the volume's genre, as determined by the cataloger of the work. Values are derived from the Genre/Form Field in the MARC record.
The year in which that edition of the volume was first published.
Named list column. Information about where the volume was
first published. Includes id, type, and name. type
is taken from the
Bibframe Instance's provisionActivity's place rdf:about node, which are
derived from the country codes in the MARC 008 field.
Type of resource, e.g., "text".
Bibliographic format (e.g. "BK").
The cataloger-determined language or languages of the volume. Taken from the Bibframe Work's language's identifiedBy's value node, which is derived from the Language Code field in the MARC record. This may differ from the automatically detected language of any given page in the page-level metadata returned by get_hathi_page_meta
The date on which the metadata portion of the Extracted Features file is generated, in YYYYMMDD format
The most recent date the volume's copyright status was updated.
Information about the publisher of the volume.
The ISBN of the volume (when a book).
The ISSN of the volume (when a journal).
The OCLC number for the volume. An OCLC number is an identifier assigned to items as they are cataloged in a library.
The Library of Congress Control Number for the volume. An LCCN is a unique number that is assigned during cataloging.
Library classification.
The Handle URL for the volume, which will point to the volume in the HathiTrust Digital Library. E.g. "http://hdl.handle.net/2027/mdp.39015062779023"
The Hathi Trust Bibliographic record ID number.
The source institution record ID number.
The source institution.
Type of access rights.
Information regarding which volume, issue, and/or year the HathiTrust volume was published.
Whether the item is a government document.
Contains information regarding the author(s), editor(s), or other agents involved in creating the volume.
The cataloger-determined resource type of the volume (e.g., monographic, etc.).
Columns containing subject info, if present.
Be mindful that downloading a large number of metadata records can take quite some time. In practice I have found that downloading full metadata from more than about 1000 records is a dicey proposition; if you need metadata for many thousands of records, you are probably better off using the big hathifile (see download_hathifile and load_raw_hathifile).
# \donttest{
dir <- tempdir()
workset <- workset_builder(name = "Tocqueville")
get_workset_meta(workset[1:5, ], metadata_dir = dir)
#> Getting download key...
#> Downloading metadata for 5 volumes. This might take some time.
#> # A tibble: 5 × 16
#> htid acces…¹ acces…² url title dateC…³ lastR…⁴ pubDate schem…⁵ typeO…⁶
#> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <chr> <chr>
#> 1 mdp.39015… google ic http… Oeuv… 2.02e7 2.02e7 1991 https:… http:/…
#> 2 mdp.39015… google pd http… De l… 2.02e7 2.02e7 1836 https:… http:/…
#> 3 mdp.39015… google ic http… Demo… 2.02e7 2.02e7 2004 https:… http:/…
#> 4 nyp.33433… google pd http… Demo… 2.02e7 2.02e7 1847 https:… http:/…
#> 5 uva.x0004… google pd http… The … 2.02e7 2.02e7 1849 https:… http:/…
#> # … with 6 more variables: language <chr>, oclc <dbl>, genre <chr>,
#> # contributor <chr>, publisher <chr>, pubPlace <chr>, and abbreviated
#> # variable names ¹accessProfile, ²accessRights, ³dateCreated,
#> # ⁴lastRightsUpdateDate, ⁵schemaVersion, ⁶typeOfResource
## We can also pass a vector of htids:
get_workset_meta(workset$htid[1:5], metadata_dir = dir)
#> Metadata has already been downloaded. Returning cached metadata.
#> # A tibble: 5 × 16
#> htid acces…¹ acces…² url title dateC…³ lastR…⁴ pubDate schem…⁵ typeO…⁶
#> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <chr> <chr>
#> 1 mdp.39015… google ic http… Oeuv… 2.02e7 2.02e7 1991 https:… http:/…
#> 2 mdp.39015… google pd http… De l… 2.02e7 2.02e7 1836 https:… http:/…
#> 3 mdp.39015… google ic http… Demo… 2.02e7 2.02e7 2004 https:… http:/…
#> 4 nyp.33433… google pd http… Demo… 2.02e7 2.02e7 1847 https:… http:/…
#> 5 uva.x0004… google pd http… The … 2.02e7 2.02e7 1849 https:… http:/…
#> # … with 6 more variables: language <chr>, oclc <dbl>, genre <chr>,
#> # contributor <chr>, publisher <chr>, pubPlace <chr>, and abbreviated
#> # variable names ¹accessProfile, ²accessRights, ³dateCreated,
#> # ⁴lastRightsUpdateDate, ⁵schemaVersion, ⁶typeOfResource
# }