Builds a Workset of Hathi Trust vol IDs by querying the Workset Builder 2.0

Queries the SOLR endpoint of the Workset Builder 2.0 (beta) at https://solr2.htrc.illinois.edu/solr-ef20/. This API is experimental, and so this function can fail at any time if the API changes.

workset_builder(
  token,
  title,
  name,
  imprint,
  pub_date,
  lang = "eng",
  volumes_only = TRUE,
  token_join = c("AND", "OR"),
  max_vols = Inf,
  query_string,
  verbose = FALSE
)

Arguments

token: The tokens to search for in the Hathi Trust Extracted Features files. Can be a vector of characters, e.g., c("liberal", "democracy"); if a character vector with more than one element, the results are interpreted using the value of token_join -- by default AND, so that the query will find all volumes where all the tokens appear, though not necessarily in the same page (in the example, all volumes containing both "liberal" and "democracy"). If token_join is "OR" then the query will find all volumes where either of the tokens appear. Search is case-insensitive; phrases can be included (e.g., "liberal democracy"), and the database will then return matches where both terms appear in the same page (though not necessarily next to each other).
title: Title field. Multiple words will be joined with "AND"; can be a phrase (e.g., "Democracy in America").
name: Names associated with the book (e.g., author). Multiple terms will be joined with "AND"; can be a phrase (e.g., "Alexis de Tocqueville").
imprint: Imprint information (e.g., publisher). Multiple terms will be joined with "AND"; can be a phrase (e.g., "University of Chicago Press").
pub_date: Publication date in Hathi Trust metadata. Can be a range, e.g., 1800:1900, or a set of years, e.g., c(1800, 1805).
lang: Language. Default is "eng" (English); a string like "English" or any 2 or 3 letter ISO639 code (available in the dataset iso639 included with this package) is allowed. (If no language code is found, the default is to search all languages; set to NULL if you want to search all languages). Right now this function can only search one language at a time; if you wish to search for terms in more than one language, create multiple worksets and bind them together.
volumes_only: If TRUE (the default), returns only volume IDs plus a count of the number of times the tokens appear in the volume; FALSE returns both volume and page IDs where the tokens are found. Note the page IDs are 0-based; when looking for the page at the Hathi Digital Library site, it's necessary to add 1. browse_htids does this automatically.
token_join: The logical connector for the tokens in token, if more than one. Default is "AND"; the query will ask for all volumes where all tokens occur. "OR" means the query will ask for all volumes where any of the tokens occur.
max_vols: Maximum number of volumes to return. Default is Inf, all volumes. Unfortunately the calculation is done locally, rather than remotely, so generally even if it is set at a small number, you'll still need to download all returned htids.
query_string: You can pass a query string directly - this is very useful for complex queries. For a guide to SOLR query syntax, see https://solr.apache.org/guide/6_6/the-standard-query-parser.html#the-standard-query-parser; for information about what fields are available see the Workset Builder page https://solr2.htrc.illinois.edu/solr-ef20/
verbose: Whether to display the query string used. Default is FALSE. This is useful to learn how to use the more complex SOLR query syntax.

Value

A tibble with volume_ids, number of occurrences of the terms in the volume, and if volumes_only is FALSE a column for page ids.

Examples

# \donttest{
# All volumes that mention "tylenol" and "paracetamol", not necessarily in the same page
workset_builder(c("tylenol", "paracetamol"), volumes_only = FALSE)
#> # A tibble: 1,952 × 2
#>    htid               id                            
#>    <chr>              <chr>                         
#>  1 chi.11764084       chi.11764084.page-000986      
#>  2 coo.31924003340688 coo.31924003340688.page-000223
#>  3 coo.31924019099591 coo.31924019099591.page-001063
#>  4 coo.31924052053505 coo.31924052053505.page-000886
#>  5 coo.31924052053505 coo.31924052053505.page-000403
#>  6 coo.31924052073271 coo.31924052073271.page-000563
#>  7 coo.31924052541061 coo.31924052541061.page-000193
#>  8 coo.31924053019927 coo.31924053019927.page-000060
#>  9 coo.31924053019950 coo.31924053019950.page-000143
#> 10 coo.31924053885509 coo.31924053885509.page-000492
#> # … with 1,942 more rows

# All volumes mentioning "demagogue" published between 1800 and 1900
workset_builder("demagogue", pub_date = 1800:1900)
#> # A tibble: 101,354 × 2
#>    htid                          n
#>    <chr>                     <int>
#>  1 nyp.33433070238617           94
#>  2 njp.32101068970605           43
#>  3 aeu.ark:/13960/t75t42186     42
#>  4 wu.89100065895               40
#>  5 uc2.ark:/13960/t7kp81q4p     39
#>  6 uiug.30112114022434          38
#>  7 iau.31858039473388           37
#>  8 uc2.ark:/13960/t7tm76n4z     37
#>  9 uiuo.ark:/13960/t23b9fk80    36
#> 10 uiug.30112114022442          35
#> # … with 101,344 more rows

# All volumes mentioning "demagogue" with "Tocqueville" and "Reeve"
# in the author field
workset_builder("demagogue", name = c("Tocqueville", "Reeve"))
#> # A tibble: 58 × 2
#>    htid                         n
#>    <chr>                    <int>
#>  1 nyp.33433081795365           2
#>  2 aeu.ark:/13960/t23b7448p     1
#>  3 aeu.ark:/13960/t6349g80j     1
#>  4 aeu.ark:/13960/t9p27mv8b     1
#>  5 coo.31924030454809           1
#>  6 coo.31924030454817           1
#>  7 hvd.32044004561239           1
#>  8 hvd.32044010093979           1
#>  9 hvd.32044010393551           1
#> 10 hvd.32044011894870           1
#> # … with 48 more rows

# All volumes with "Tocqueville" in the author field
workset_builder(name = "Tocqueville")
#> # A tibble: 529 × 2
#>    htid                   n
#>    <chr>              <int>
#>  1 mdp.39015079304757  1358
#>  2 mdp.39015008706338  1213
#>  3 mdp.39015058109706   945
#>  4 nyp.33433081795357   910
#>  5 uva.x000469924       909
#>  6 hvd.32044051720316   906
#>  7 coo.31924030454809   904
#>  8 nyp.33433081795266   903
#>  9 ien.35556041207515   901
#> 10 nyp.33433081795381   901
#> # … with 519 more rows
# }