R/bookworm.R
query_bookworm.Rd
This function retrieves word frequency data from the Hathi Trust Bookworm Server at https://bookworm.htrc.illinois.edu/develop/, with options to group the results according to various forms of metadata and to limit according to that same metadata. It uses code authored by Ben Schmidt (from https://github.com/bmschmidt/edinburgh/).
Term to get frequencies for. Can be a vector of strings. It can be left empty if one is interested primarily in statistics about the corpus as a whole.
Category to group results by. The default is date_year
, which
groups results by year.
Default is TRUE
, ignores case in search.
The default is words per million, counttype = "WordsPerMillion"
. According to the API documentation,
the following options are available:
WordCount
: The number of words matching the terms in search_limits
for
each group. (If no words
key is specified, the sum of all the words in
the book).
TextCount
: The number of texts matching the constraints on
search_limits
for each group.
WordsPerMillion
: The number of words in the search_limits
per million
words in the broader set. (Words per million, rather than percent, gives a
more legible number).
TextPercent
: The percentage of texts in the broader group matching the
search terms.
TotalTexts
: The number of texts matching the constraints on
compare_limits
. (By selecting TextCount
and TotalTexts
, you can
derive TextPercent
locally, if you prefer).
TotalWords
: The number of words in the larger set.
WordsRatio
: equal to WordCount/TotalWords
. Useful when method = "search_results"
.
SumWords
: equal to TotalWords + WordCount
TextRatio
: equal to TextCount/TotalTexts
.
SumTexts
: equal to TextCount + TotalTexts
It is possible to combine some of these - e.g., counttype = c("TextCount",
"TextPercent"). But it is not possible to combine Text-
counts with
Word-
counts in this version of the API.
Type of results to return. Can be data
(the default -
automatically converted to a proper tibble when possible; the JSON is
structured as "nested dicts for each grouping in groups
pointing to an
array consisting of the results for each count in counttype
", according
to the API documentation.),
returnPossibleFields
(metadata fields available to use in groups
), and
search_results
(a list of books and HathiTrust URLs matching a query).
Note that search_results
has a limit of 100 books at the moment, randomly
selected. Notes:
When using returnPossibleFields
all other fields are ignored.
When using search_results
only the first 100 results are returned,
sorted by the percentage of hits in the text. That biases towards either
texts that use the words a lot, or texts that use it rarely. It is possible
to use counttype = "WordsRatio"
to return a list sorted randomly,
weighted by the number of times the word appears in it. The API documentation
notes that "this means that a random word from the first text should
represent a random usage from the overall sample. The current MySQL-python
implementation uses an approximation for this:
LOG(1-RAND())/sum(main.count)
that should mimic a weighted random
ordering for most distributions, but in some cases it may not behave as
intended."
Format of returned results. In theory the Bookworm DB should be able to return results as "json", "tsv", "csv", or even "feather"; currently only "json" works (and it's the only supported format here).
Min and max year as a two-element numeric vector. Default is
c(1920, 2000)
.
A word to compare relative frequencies to. Currently this
is most useful with counttype = "WordsRatio"
; this compares the relative
frequency of two words.
Whether to return the raw json. Useful for complex queries where the function does not know how to return a tibble, or when you want to use the raw json to produce a different data structure.
If TRUE
, shows the JSON query once built.
You can directly pass on a query string (in JSON). This is
useful for very complex queries, but there's no checking that the
parameters are correct so you may encounter unexpected errors. See
https://bookworm-project.github.io/Docs/query_structure.html for more on
the query structure. If you use query
, all other parameters are silently
ignored. Use with care!
Additional parameters passed to the query builder; these would be
the fields that method = returnPossibleFields
returns, including fields
to group the query by (e.g., groups = "class"). At the date of this
writing, these fields were: lc_classes, lc_subclass, fiction_nonfiction,
genres, languages, htsource, digitization_agent_code, mainauthor,
publisher, format, is_gov_doc, page_count_bin, word_count_bin,
publication_country, publication_state, publication_place. These are not
documented, and in some cases one must know the exact string to search for;
for example, a search with mainauthor = "Tocqueville"
won't find
anything, but a search with mainauthor = "Tocqueville, Alexis de 1805-1859."
may. These fields should be accessible via
options("hathiTools.bookworm.fields")
A tidy tibble whenever possible, with columns for each grouping
parameter, the word (if any), and the counts and counttypes. For method = "search_result"
, a workset that can be used in browse_htids and
get_workset_meta.
# \donttest{
query_bookworm(word = c("democracy", "monarchy"), lims = c(1760, 2000),
counttype = c("WordsPerMillion", "WordCount"))
#> # A tibble: 964 × 4
#> word date_year value counttype
#> <chr> <int> <dbl> <chr>
#> 1 democracy 1760 0.382 WordsPerMillion
#> 2 democracy 1760 108 WordCount
#> 3 democracy 1761 0.253 WordsPerMillion
#> 4 democracy 1761 60 WordCount
#> 5 democracy 1762 0.332 WordsPerMillion
#> 6 democracy 1762 92 WordCount
#> 7 democracy 1763 0.455 WordsPerMillion
#> 8 democracy 1763 113 WordCount
#> 9 democracy 1764 0.593 WordsPerMillion
#> 10 democracy 1764 148 WordCount
#> # … with 954 more rows
query_bookworm(word = "democracy", groups = c("date_year", "lc_classes"),
lims = c(1900,2000))
#> # A tibble: 2,222 × 5
#> word date_year lc_classes value counttype
#> <chr> <int> <chr> <dbl> <chr>
#> 1 democracy 1900 unknown 4.44 WordsPerMill…
#> 2 democracy 1900 Language and Literature 4.87 WordsPerMill…
#> 3 democracy 1900 Social Sciences 7.27 WordsPerMill…
#> 4 democracy 1900 General and Old World History 8.99 WordsPerMill…
#> 5 democracy 1900 Science 0.166 WordsPerMill…
#> 6 democracy 1900 Philosophy, Psychology, and Religion 4.28 WordsPerMill…
#> 7 democracy 1900 Technology 0.372 WordsPerMill…
#> 8 democracy 1900 Law 0.488 WordsPerMill…
#> 9 democracy 1900 Political Science 13.1 WordsPerMill…
#> 10 democracy 1900 General Works 14.6 WordsPerMill…
#> # … with 2,212 more rows
query_bookworm(word = "democracy", groups = "date_year", date_year = "1941",
lc_classes = "Education", method = "search_results")
#> # A tibble: 100 × 3
#> htid title url
#> <chr> <chr> <chr>
#> 1 nc01.ark:/13960/t2v41mn4r Teaching democracy in the North Carolina pub… http…
#> 2 mdp.39015062763720 The education of free men in American democr… http…
#> 3 uc1.$b67929 The education of free men in American democr… http…
#> 4 mdp.39015068297905 The education of free men in American democr… http…
#> 5 uc1.$b67873 Pennsylvania bill of rights week. Recommenda… http…
#> 6 mdp.39015035886111 Education in a world of fear, http…
#> 7 coo.31924013433044 Education in a world of fear, http…
#> 8 mdp.39015031665543 Education and the morale of a free people. http…
#> 9 uiug.30112108068831 Proceedings of the convention. http…
#> 10 uc1.$b67928 Education and the morale of a free people. http…
#> # … with 90 more rows
# }