Queries the Hathi Trust Bookworm Server at https://bookworm.htrc.illinois.edu/develop/

This function retrieves word frequency data from the Hathi Trust Bookworm Server at https://bookworm.htrc.illinois.edu/develop/, with options to group the results according to various forms of metadata and to limit according to that same metadata. It uses code authored by Ben Schmidt (from https://github.com/bmschmidt/edinburgh/).

query_bookworm(
  word,
  groups = "date_year",
  ignore_case = TRUE,
  counttype = "WordsPerMillion",
  method = c("data", "returnPossibleFields", "search_results"),
  format = c("json", "csv", "tsv", "feather"),
  lims = c(1920, 2000),
  compare_to,
  as_json = FALSE,
  verbose = FALSE,
  query,
  ...
)

Arguments

word

Term to get frequencies for. Can be a vector of strings. It can be left empty if one is interested primarily in statistics about the corpus as a whole.

groups

Category to group results by. The default is date_year, which groups results by year.

ignore_case

Default is TRUE, ignores case in search.

counttype

The default is words per million, counttype = "WordsPerMillion". According to the API documentation, the following options are available:

WordCount: The number of words matching the terms in search_limits for each group. (If no words key is specified, the sum of all the words in the book).

TextCount: The number of texts matching the constraints on search_limits for each group.

WordsPerMillion: The number of words in the search_limits per million words in the broader set. (Words per million, rather than percent, gives a more legible number).

TextPercent: The percentage of texts in the broader group matching the search terms.

TotalTexts: The number of texts matching the constraints on compare_limits. (By selecting TextCount and TotalTexts, you can derive TextPercent locally, if you prefer).

TotalWords: The number of words in the larger set.

WordsRatio: equal to WordCount/TotalWords. Useful when method = "search_results".

SumWords: equal to TotalWords + WordCount

TextRatio: equal to TextCount/TotalTexts.

SumTexts: equal to TextCount + TotalTexts

It is possible to combine some of these - e.g., counttype = c("TextCount", "TextPercent"). But it is not possible to combine Text- counts with Word- counts in this version of the API.

method

Type of results to return. Can be data (the default - automatically converted to a proper tibble when possible; the JSON is structured as "nested dicts for each grouping in groups pointing to an array consisting of the results for each count in counttype", according to the API documentation.), returnPossibleFields (metadata fields available to use in groups), and search_results (a list of books and HathiTrust URLs matching a query). Note that search_results has a limit of 100 books at the moment, randomly selected. Notes:

When using returnPossibleFields all other fields are ignored.
When using search_results only the first 100 results are returned, sorted by the percentage of hits in the text. That biases towards either texts that use the words a lot, or texts that use it rarely. It is possible to use counttype = "WordsRatio" to return a list sorted randomly, weighted by the number of times the word appears in it. The API documentation notes that "this means that a random word from the first text should represent a random usage from the overall sample. The current MySQL-python implementation uses an approximation for this: LOG(1-RAND())/sum(main.count) that should mimic a weighted random ordering for most distributions, but in some cases it may not behave as intended."

format

Format of returned results. In theory the Bookworm DB should be able to return results as "json", "tsv", "csv", or even "feather"; currently only "json" works (and it's the only supported format here).

lims

Min and max year as a two-element numeric vector. Default is c(1920, 2000).

compare_to

A word to compare relative frequencies to. Currently this is most useful with counttype = "WordsRatio"; this compares the relative frequency of two words.

as_json

Whether to return the raw json. Useful for complex queries where the function does not know how to return a tibble, or when you want to use the raw json to produce a different data structure.

verbose

If TRUE, shows the JSON query once built.

query

You can directly pass on a query string (in JSON). This is useful for very complex queries, but there's no checking that the parameters are correct so you may encounter unexpected errors. See https://bookworm-project.github.io/Docs/query_structure.html for more on the query structure. If you use query, all other parameters are silently ignored. Use with care!

...

Additional parameters passed to the query builder; these would be the fields that method = returnPossibleFields returns, including fields to group the query by (e.g., groups = "class"). At the date of this writing, these fields were: lc_classes, lc_subclass, fiction_nonfiction, genres, languages, htsource, digitization_agent_code, mainauthor, publisher, format, is_gov_doc, page_count_bin, word_count_bin, publication_country, publication_state, publication_place. These are not documented, and in some cases one must know the exact string to search for; for example, a search with mainauthor = "Tocqueville" won't find anything, but a search with mainauthor = "Tocqueville, Alexis de 1805-1859." may. These fields should be accessible via options("hathiTools.bookworm.fields")

Value

A tidy tibble whenever possible, with columns for each grouping parameter, the word (if any), and the counts and counttypes. For method = "search_result", a workset that can be used in browse_htids and get_workset_meta.

Author

Ben Schmidt

Examples

# \donttest{
query_bookworm(word = c("democracy", "monarchy"), lims = c(1760, 2000),
  counttype = c("WordsPerMillion", "WordCount"))
#> # A tibble: 964 × 4
#>    word      date_year   value counttype      
#>    <chr>         <int>   <dbl> <chr>          
#>  1 democracy      1760   0.382 WordsPerMillion
#>  2 democracy      1760 108     WordCount      
#>  3 democracy      1761   0.253 WordsPerMillion
#>  4 democracy      1761  60     WordCount      
#>  5 democracy      1762   0.332 WordsPerMillion
#>  6 democracy      1762  92     WordCount      
#>  7 democracy      1763   0.455 WordsPerMillion
#>  8 democracy      1763 113     WordCount      
#>  9 democracy      1764   0.593 WordsPerMillion
#> 10 democracy      1764 148     WordCount      
#> # … with 954 more rows

query_bookworm(word = "democracy", groups = c("date_year", "lc_classes"),
  lims = c(1900,2000))
#> # A tibble: 2,222 × 5
#>    word      date_year lc_classes                            value counttype    
#>    <chr>         <int> <chr>                                 <dbl> <chr>        
#>  1 democracy      1900 unknown                               4.44  WordsPerMill…
#>  2 democracy      1900 Language and Literature               4.87  WordsPerMill…
#>  3 democracy      1900 Social Sciences                       7.27  WordsPerMill…
#>  4 democracy      1900 General and Old World History         8.99  WordsPerMill…
#>  5 democracy      1900 Science                               0.166 WordsPerMill…
#>  6 democracy      1900 Philosophy, Psychology, and Religion  4.28  WordsPerMill…
#>  7 democracy      1900 Technology                            0.372 WordsPerMill…
#>  8 democracy      1900 Law                                   0.488 WordsPerMill…
#>  9 democracy      1900 Political Science                    13.1   WordsPerMill…
#> 10 democracy      1900 General Works                        14.6   WordsPerMill…
#> # … with 2,212 more rows

query_bookworm(word = "democracy", groups = "date_year", date_year = "1941",
  lc_classes = "Education", method = "search_results")
#> # A tibble: 100 × 3
#>    htid                      title                                         url  
#>    <chr>                     <chr>                                         <chr>
#>  1 nc01.ark:/13960/t2v41mn4r Teaching democracy in the North Carolina pub… http…
#>  2 mdp.39015062763720        The education of free men in American democr… http…
#>  3 uc1.$b67929               The education of free men in American democr… http…
#>  4 mdp.39015068297905        The education of free men in American democr… http…
#>  5 uc1.$b67873               Pennsylvania bill of rights week. Recommenda… http…
#>  6 mdp.39015035886111        Education in a world of fear,                 http…
#>  7 coo.31924013433044        Education in a world of fear,                 http…
#>  8 mdp.39015031665543        Education and the morale of a free people.    http…
#>  9 uiug.30112108068831       Proceedings of the convention.                http…
#> 10 uc1.$b67928               Education and the morale of a free people.    http…
#> # … with 90 more rows
# }