R/workset_builder.R
workset_builder.Rd
Queries the SOLR endpoint of the Workset Builder 2.0 (beta) at https://solr2.htrc.illinois.edu/solr-ef20/. This API is experimental, and so this function can fail at any time if the API changes.
workset_builder(
token,
title,
name,
imprint,
pub_date,
lang = "eng",
volumes_only = TRUE,
token_join = c("AND", "OR"),
max_vols = Inf,
query_string,
verbose = FALSE
)
The tokens to search for in the Hathi Trust Extracted Features
files. Can be a vector of characters, e.g., c("liberal", "democracy")
; if
a character vector with more than one element, the results are interpreted
using the value of token_join
-- by default AND, so that the query will
find all volumes where all the tokens appear, though not necessarily in
the same page (in the example, all volumes containing both "liberal" and
"democracy"). If token_join
is "OR" then the query will find all volumes
where either of the tokens appear. Search is case-insensitive; phrases can
be included (e.g., "liberal democracy"), and the database will then return
matches where both terms appear in the same page (though not necessarily
next to each other).
Title field. Multiple words will be joined with "AND"; can be a phrase (e.g., "Democracy in America").
Names associated with the book (e.g., author). Multiple terms will be joined with "AND"; can be a phrase (e.g., "Alexis de Tocqueville").
Imprint information (e.g., publisher). Multiple terms will be joined with "AND"; can be a phrase (e.g., "University of Chicago Press").
Publication date in Hathi Trust metadata. Can be a range,
e.g., 1800:1900
, or a set of years, e.g., c(1800, 1805)
.
Language. Default is "eng" (English); a string like "English" or
any 2 or 3 letter ISO639 code (available in the dataset iso639 included
with this package) is allowed. (If no language code is found, the default
is to search all languages; set to NULL
if you want to search all
languages). Right now this function can only search one language at a time;
if you wish to search for terms in more than one language, create multiple
worksets and bind them together.
If TRUE
(the default), returns only volume IDs plus a
count of the number of times the tokens appear in the volume; FALSE
returns both volume and page IDs where the tokens are found. Note the page
IDs are 0-based; when looking for the page at the Hathi Digital Library
site, it's necessary to add 1. browse_htids does this automatically.
The logical connector for the tokens in token
, if more
than one. Default is "AND"; the query will ask for all volumes where all
tokens occur. "OR" means the query will ask for all volumes where any of
the tokens occur.
Maximum number of volumes to return. Default is Inf
, all
volumes. Unfortunately the calculation is done locally, rather than
remotely, so generally even if it is set at a small number, you'll still
need to download all returned htids.
You can pass a query string directly - this is very useful for complex queries. For a guide to SOLR query syntax, see https://solr.apache.org/guide/6_6/the-standard-query-parser.html#the-standard-query-parser; for information about what fields are available see the Workset Builder page https://solr2.htrc.illinois.edu/solr-ef20/
Whether to display the query string used. Default is FALSE
.
This is useful to learn how to use the more complex SOLR query syntax.
A tibble with volume_ids, number of occurrences of the terms in the
volume, and if volumes_only
is FALSE
a column for page ids.
# \donttest{
# All volumes that mention "tylenol" and "paracetamol", not necessarily in the same page
workset_builder(c("tylenol", "paracetamol"), volumes_only = FALSE)
#> # A tibble: 1,952 × 2
#> htid id
#> <chr> <chr>
#> 1 chi.11764084 chi.11764084.page-000986
#> 2 coo.31924003340688 coo.31924003340688.page-000223
#> 3 coo.31924019099591 coo.31924019099591.page-001063
#> 4 coo.31924052053505 coo.31924052053505.page-000886
#> 5 coo.31924052053505 coo.31924052053505.page-000403
#> 6 coo.31924052073271 coo.31924052073271.page-000563
#> 7 coo.31924052541061 coo.31924052541061.page-000193
#> 8 coo.31924053019927 coo.31924053019927.page-000060
#> 9 coo.31924053019950 coo.31924053019950.page-000143
#> 10 coo.31924053885509 coo.31924053885509.page-000492
#> # … with 1,942 more rows
# All volumes mentioning "demagogue" published between 1800 and 1900
workset_builder("demagogue", pub_date = 1800:1900)
#> # A tibble: 101,354 × 2
#> htid n
#> <chr> <int>
#> 1 nyp.33433070238617 94
#> 2 njp.32101068970605 43
#> 3 aeu.ark:/13960/t75t42186 42
#> 4 wu.89100065895 40
#> 5 uc2.ark:/13960/t7kp81q4p 39
#> 6 uiug.30112114022434 38
#> 7 iau.31858039473388 37
#> 8 uc2.ark:/13960/t7tm76n4z 37
#> 9 uiuo.ark:/13960/t23b9fk80 36
#> 10 uiug.30112114022442 35
#> # … with 101,344 more rows
# All volumes mentioning "demagogue" with "Tocqueville" and "Reeve"
# in the author field
workset_builder("demagogue", name = c("Tocqueville", "Reeve"))
#> # A tibble: 58 × 2
#> htid n
#> <chr> <int>
#> 1 nyp.33433081795365 2
#> 2 aeu.ark:/13960/t23b7448p 1
#> 3 aeu.ark:/13960/t6349g80j 1
#> 4 aeu.ark:/13960/t9p27mv8b 1
#> 5 coo.31924030454809 1
#> 6 coo.31924030454817 1
#> 7 hvd.32044004561239 1
#> 8 hvd.32044010093979 1
#> 9 hvd.32044010393551 1
#> 10 hvd.32044011894870 1
#> # … with 48 more rows
# All volumes with "Tocqueville" in the author field
workset_builder(name = "Tocqueville")
#> # A tibble: 529 × 2
#> htid n
#> <chr> <int>
#> 1 mdp.39015079304757 1358
#> 2 mdp.39015008706338 1213
#> 3 mdp.39015058109706 945
#> 4 nyp.33433081795357 910
#> 5 uva.x000469924 909
#> 6 hvd.32044051720316 906
#> 7 coo.31924030454809 904
#> 8 nyp.33433081795266 903
#> 9 ien.35556041207515 901
#> 10 nyp.33433081795381 901
#> # … with 519 more rows
# }