vignettes/articles/using_the_hathi_bookworm.Rmd
using_the_hathi_bookworm.Rmd
The Hathi Trust Bookworm (https://bookworm.htrc.illinois.edu/develop/) is a tool similar to the Google Books Ngram viewer that allows one to retrieve word frequency data from the texts in the Hathi Trust Digital Library. With about 17 million digitised volumes in its database (many of them originally digitised for the Google Books project), the Hathi Trust Bookworm is a very powerful tool to explore trends in word frequencies over time. Moreover, in contrast to the Google Ngram viewer, the Bookworm can search over the metadata of the collection, making possible more informative queries about the sources of particular word frequency trends.1
This package offers one function, query_bookworm()
, that makes it relatively easy to retrieve word frequency and other data from the Hathi Trust Bookworm into R, and to use it for exploratory analyses of word frequency trends.
For example, suppose we are interested in the changing frequencies of terms like “democracy”, “dictatorship”, “monarchy”, and so on. We can download the frequency of these terms (according to various metrics) with a single call:
library(hathiTools)
library(tidyverse)
library(slider) # For moving averages
res <- query_bookworm(c("democracy", "dictatorship", "monarchy",
"aristocracy", "oligarchy", "tyranny",
"autocracy"),
counttype = c("WordsPerMillion"),
lims = c(1700, 2020))
This results in a nice, tidy tibble:
res
#> # A tibble: 2,247 × 4
#> word date_year value counttype
#> <chr> <int> <dbl> <chr>
#> 1 democracy 1700 2.70 WordsPerMillion
#> 2 democracy 1701 0.123 WordsPerMillion
#> 3 democracy 1702 1.34 WordsPerMillion
#> 4 democracy 1703 0.765 WordsPerMillion
#> 5 democracy 1704 0.428 WordsPerMillion
#> 6 democracy 1705 0.0989 WordsPerMillion
#> 7 democracy 1706 0.163 WordsPerMillion
#> 8 democracy 1707 0.446 WordsPerMillion
#> 9 democracy 1708 0.0652 WordsPerMillion
#> 10 democracy 1709 0.0681 WordsPerMillion
#> # … with 2,237 more rows
#> # ℹ Use `print(n = ...)` to see more rows
Which can be used for plotting:
res %>%
mutate(counttype = "Words per million") %>%
group_by(word) %>%
mutate(rolling_avg = slide_dbl(value, mean, .before = 10, .after = 10)) %>%
ggplot(aes(x = date_year, color = word)) +
geom_line(aes(y = value), alpha = 0.3) +
geom_line(aes(x = date_year, y = rolling_avg)) +
facet_wrap(~counttype) +
labs(x = "Approx. year of publication", y = "", subtitle = "10 year rolling average, books published between 1700-2020",
title = "Frequency of 'democracy' and other political terms in the HathiTrust corpus") +
theme_bw() +
scale_color_viridis_d()
Figure 1
The trends are clear: “democracy” becomes a much more salient term during the 19th and 20th centuries, with big peaks around the World Wars and the end of the Cold War, and an increasing level of usage in this corpus since the 1970s.
We can also look at the frequency of democracy relative to another word (e.g., dictatorship) across time by using counttype = "WordsRatio"
:
res2 <- query_bookworm(word = "democracy", compare_to = "dictatorship",
lims = c(1900, 2000), counttype = "WordsRatio")
res2 %>%
ggplot(aes(x = date_year, y = value)) +
geom_line() +
theme_bw() +
labs(title = "Frequency of 'democracy' relative to 'dictatorship' in the Bookworm corpus",
x = "Approx. date of publication",
y = "Ratio")
Figure 2
‘Democracy’ is always used more frequently than ‘dictatorship’ in this corpus during the 20th century, but especially right around the First World War.
We can also explore the kinds of books where ‘democracy’ is mentioned in the 20th century. This query groups the volumes that mention ‘democracy’ by both the year of publication and the volume classification in the Hathi Trust metadata:
res2 <- query_bookworm(word = "democracy", groups = c("date_year", "lc_classes"),
lims = c(1900,2000))
res2 %>%
ggplot(aes(x = date_year, y = fct_reorder(str_trunc(lc_classes, 40), value))) +
geom_tile(aes(fill = value)) +
facet_wrap(~word, scales = "free_y") +
scale_fill_gradient2() +
theme_bw() +
labs(y = "", x = "Year", title = "Frequency of 'democracy' \nacross library classifications",
fill = "Words per million") +
theme(legend.position = "bottom")
Figure 3
As we might expect, most volumes classified as “Political Science”, “History”, and “Social Sciences” mention democracy more often than medicine or agriculture, especially in the second half of the 20th century. But a surprising number of books classified as “Education” mention democracy quite a bit, especially right around the Second World War.
It is also possible to further limit the query to, e.g., books published in a particular language or written by a particular author. For example, this gives the number of texts that use the word “democracy” in 1900-2000, grouped by library classification and language.
res3 <- query_bookworm(word = "democracy",
lims = c(1900, 2000),
groups = c("lc_classes", "languages"),
counttype = c("TextCount"))
res3 %>%
filter(languages == "eng") %>%
arrange(desc(value))
#> # A tibble: 22 × 7
#> word lc_classes langu…¹ value count…² min_y…³ max_y…⁴
#> <chr> <chr> <chr> <int> <chr> <dbl> <dbl>
#> 1 democracy unknown eng 697511 TextCo… 1900 2000
#> 2 democracy Social Sciences eng 186912 TextCo… 1900 2000
#> 3 democracy General and Old World Histo… eng 133276 TextCo… 1900 2000
#> 4 democracy Language and Literature eng 117172 TextCo… 1900 2000
#> 5 democracy Political Science eng 91759 TextCo… 1900 2000
#> 6 democracy Education eng 65825 TextCo… 1900 2000
#> 7 democracy Philosophy, Psychology, and… eng 61692 TextCo… 1900 2000
#> 8 democracy Law eng 56733 TextCo… 1900 2000
#> 9 democracy Bibliography, Library Scien… eng 55101 TextCo… 1900 2000
#> 10 democracy History of America eng 48800 TextCo… 1900 2000
#> # … with 12 more rows, and abbreviated variable names ¹languages, ²counttype,
#> # ³min_year, ⁴max_year
#> # ℹ Use `print(n = ...)` to see more rows
Among texts which have some classification (most don’t!), ~90,000 political science texts mention the term.
And this query finds how many volumes between 1900 and 2000 had Alexis de Tocqueville as a first author published in English:
res3 <- query_bookworm(lims = c(1900, 2000),
groups = c("date_year"),
counttype = "TotalTexts",
mainauthor = c("Tocqueville, Alexis de, 1805-1859.",
"Tocqueville, Alexis de, 1805-1859"),
languages = "eng")
res3
#> # A tibble: 23 × 3
#> date_year value counttype
#> <int> <int> <chr>
#> 1 1900 9 TotalTexts
#> 2 1904 2 TotalTexts
#> 3 1909 1 TotalTexts
#> 4 1948 4 TotalTexts
#> 5 1949 1 TotalTexts
#> 6 1952 2 TotalTexts
#> 7 1954 7 TotalTexts
#> 8 1958 1 TotalTexts
#> 9 1959 4 TotalTexts
#> 10 1961 2 TotalTexts
#> # … with 13 more rows
#> # ℹ Use `print(n = ...)` to see more rows
One can use method = "returnPossibleFields"
to return the fields available for limiting a query or grouping the results:
query_bookworm(word = "", method = "returnPossibleFields")
#> # A tibble: 17 × 6
#> name type description tablename dbname anchor
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 lc_classes character "" lc_classesLookup lc_cl… bookid
#> 2 lc_subclass character "" lc_subclassLookup lc_su… bookid
#> 3 fiction_nonfiction character "" fiction_nonficti… ficti… bookid
#> 4 genres character "" genresLookup genres bookid
#> 5 languages character "" languagesLookup langu… bookid
#> 6 htsource character "" htsourceLookup htsou… bookid
#> 7 digitization_agent_code character "" digitization_age… digit… bookid
#> 8 mainauthor character "" mainauthorLookup maina… bookid
#> 9 publisher character "" publisherLookup publi… bookid
#> 10 format character "" formatLookup format bookid
#> 11 is_gov_doc character "" is_gov_docLookup is_go… bookid
#> 12 page_count_bin character "" page_count_binLo… page_… bookid
#> 13 word_count_bin character "" word_count_binLo… word_… bookid
#> 14 publication_country character "" publication_coun… publi… bookid
#> 15 publication_state character "" publication_stat… publi… bookid
#> 16 publication_place character "" publication_plac… publi… bookid
#> 17 date_year integer "" fastcat date_… bookid
We can also get a sample of the book titles and links for a particular year. (It’s a limited sample; the database will only pull the top 100 books mentioning the term, weighted by the frequency of the term in the volume). For example, we can pull out the top 100 books in the category “Education” that mention the word “democracy” in 1941, the year where education books seem to be most likely to mention democracy:
res4 <- query_bookworm(word = "democracy",
date_year = "1941",
lc_classes = "Education",
method = "search_results")
res4
#> # A tibble: 100 × 3
#> htid title url
#> <chr> <chr> <chr>
#> 1 nc01.ark:/13960/t2v41mn4r Teaching democracy in the North Carolina pub… http…
#> 2 mdp.39015062763720 The education of free men in American democr… http…
#> 3 uc1.$b67929 The education of free men in American democr… http…
#> 4 mdp.39015068297905 The education of free men in American democr… http…
#> 5 uc1.$b67873 Pennsylvania bill of rights week. Recommenda… http…
#> 6 mdp.39015035886111 Education in a world of fear, http…
#> 7 coo.31924013433044 Education in a world of fear, http…
#> 8 mdp.39015031665543 Education and the morale of a free people. http…
#> 9 uiug.30112108068831 Proceedings of the convention. http…
#> 10 uc1.$b67928 Education and the morale of a free people. http…
#> # … with 90 more rows
#> # ℹ Use `print(n = ...)` to see more rows
If you need a bigger sample, use the function workset_builder()
to query the Hathi Trust’s Workset Builder 2.0; this can help you download even hundreds of thousands of volume IDs that meet specified criteria. (See also the article on example workflows for more).
We can investigate further any of these volumes by downloading their associated “Extracted Features” file (that is, a file with token counts and part of speech information that the Hathi Trust makes available). Here we download the word frequencies for the second Hathi Trust id, The education of free men in American democracy., available at http://hdl.handle.net/2027/mdp.39015062763720, as a nice tidy tibble
suitable for analysis with a package like tidytext.
tmp <- tempdir()
extracted_features <- get_hathi_counts(res4$htid[2], dir = tmp)
extracted_features
#> # A tibble: 20,600 × 6
#> htid token POS count section page
#> <chr> <chr> <chr> <int> <chr> <int>
#> 1 mdp.39015062763720 COMMISSION NNP 1 body 1
#> 2 mdp.39015062763720 in IN 1 body 1
#> 3 mdp.39015062763720 Free NNP 1 body 1
#> 4 mdp.39015062763720 Men NNP 1 body 1
#> 5 mdp.39015062763720 Democracy NNP 1 body 1
#> 6 mdp.39015062763720 School NNP 1 body 1
#> 7 mdp.39015062763720 POLICIES NNPS 1 body 1
#> 8 mdp.39015062763720 National NNP 1 body 1
#> 9 mdp.39015062763720 The DT 1 body 1
#> 10 mdp.39015062763720 Association NNP 2 body 1
#> # … with 20,590 more rows
#> # ℹ Use `print(n = ...)` to see more rows
And we can extract the full metadata for that particular volume, which tells us this title was created by the Educational Policies Commission, National Education Association of the United States and the American Association of School Administrators:
meta <- get_hathi_meta(res4$htid[2], dir = tmp)
meta
#> # A tibble: 1 × 22
#> htid schem…¹ id type dateC…² title contr…³ pubDate publi…⁴ pubPl…⁵
#> <chr> <chr> <chr> <chr> <int> <chr> <chr> <int> <chr> <chr>
#> 1 mdp.3901506… https:… http… "[[\… 2.02e7 The … "[{\"i… 1941 "{\"id… "{\"id…
#> # … with 12 more variables: language <chr>, accessRights <chr>,
#> # accessProfile <chr>, sourceInstitution <chr>, mainEntityOfPage <chr>,
#> # lcc <chr>, lccn <chr>, oclc <chr>, category <chr>, genre <chr>,
#> # typeOfResource <chr>, lastRightsUpdateDate <int>, and abbreviated variable
#> # names ¹schemaVersion, ²dateCreated, ³contributor, ⁴publisher, ⁵pubPlace
#> # ℹ Use `colnames()` to see all variable names
You can also browse interactively these titles on the Hathi Trust website:
browse_htids(res4)
If you want to download lots of volumes and have rsync installed in your system, the functions rsync_from_hathi()
and htid_to_rsync()
can facilitate the process; see the article on “Creating and Using Hathi Trust Worksets” for more.
One can get info about the Bookworm corpus itself by using counttype = "TotalWords"
or counttype = "TotalTexts"
and omitting the ‘word’ key. This query, for example gives you the total number of texts per language in the corpus used to build the Bookworm.
res5 <- query_bookworm(counttype = "TotalTexts",
groups = c("date_year", "languages"),
lims = c(1500,2022))
library(ggrepel)
res5 %>%
mutate(language = fct_lump_n(languages, 10, w = value)) %>%
group_by(date_year, language) %>%
summarise(value = sum(value)) %>%
group_by(language) %>%
mutate(label = ifelse(date_year == max(date_year), as.character(language), NA_character_)) %>%
group_by(language) %>%
mutate(rolling_avg = slider::slide_dbl(value, mean, .before = 10, .after = 10)) %>%
ggplot() +
geom_line(aes(x = date_year, y = rolling_avg, color = language), show.legend = FALSE) +
geom_line(aes(x = date_year, y = value, color = language), show.legend = FALSE, alpha = 0.3) +
geom_text_repel(aes(x = date_year, y = value, label = label, color = language), show.legend = FALSE) +
scale_y_log10() +
theme_bw() +
labs(title = "Total texts per language in the HathiTrust bookworm",
subtitle = "Log scale. Less common languages grouped as 'other'. 10 year rolling average.",
x = "Year", y = "")
Figure 4
For a full list (with some metadata) for all volumes in the Hathi Trust collection, use the functions download_hathifile()
and load_raw_hathifile()
.
The Google Ngram Viewer does have some advantages over the bookworm, primarily the ability to retrieve data about bigram, trigram, 4-gram, and 5-gram frequencies over time, and to conduct wildcard and part-of-speech searches. This is not currently possible with the Bookworm.↩︎