The Hidden Dictionary: Text Mining Eighteenth-Century Knowledge Networks
While written works are often encountered by readers as linear phenomena, one of the most important conceptual advances offered by the Digital Humanities is the way that computational text analysis has permitted researchers to find non-linear patterns that speak to organizational principles embedded in even a single text. The methods developed to parse thousands, or millions, of texts can, in the context of a single work, reveal connections and patterns that are unavailable to a human reader.
Even in reference books, whose alphabetic order discourages the same kind of linearity found in novels, digital methods have proven effective at revealing alternative ordering principles. This has been particular important in eighteenth-century studies. For example, recent digital work on the French
Encyclopédie has sought to assess the compatibility of the multiple ways in which the text was organized by its authors. In their 2002 article, Gilles Blanchard and Mark Olsen measure the knowledge domains described by Diderot in his introduction by counting the number of renvois, or “see alsos” between articles in each domain. Similarly, Heuser, Algee-Hewitt and Bender have also reconstructed the French
Encyclopédie based on which articles are connected by renvois. In both cases, an alternative structure emerges: one that speaks to connections between domains of knowledge that are more meaningful than the alphabetic layout would suggest.
In this project, I employ a similar set of methodologies to explore the other foundational linguistic reference book of the eighteenth century, Samuel Johnson’s 1755
Dictionary of the English Language. While it lacks a system of renvois to counter-balance the prevailing alphabetic order it shares with Diderot’s work, it nevertheless contains a hidden system of connections between seemingly disparate articles, whose organization can only be revealed through quantitative analysis: the quotations used in the definitions of each word. These quotations are what separate Johnson’s dictionary from other, earlier dictionaries. In providing a contextual basis for assessing meaning, Johnson grounds definitions in historical usage and contingent situations. Yet, by Johnson’s own definition, the quotations have an
referential purpose that remains implicit within their use. And, by sheer volume, their presence is the most notable aspect of the dictionary. A given page of the 1775, second edition of the text, defines 17 words using 52 quotations. The typographical imbalance between the definitions and the quotations, which overwhelm the page, is striking, even while this is a fact that should come as no surprise to users of the OED, the spiritual successor to Johnson’s Dictionary.
This project, therefore, seeks to answer three questions. First, who is cited in what contexts in the
Dictionary? Here, a quantitative methodology should allow for unprecedented access to the fine-grained details of the text. Second, if Johnson’s
Dictionary were rearranged to group articles connected by shared quotations together, what patterns of relationship emerge? And finally, how does Johnson use his quotations to reflect back on the works that he cites?
In order to answer these questions, this project begins with an html marked-up copy of Johnson’s 1755
Dictionary. Parsing this text using regular expressions, I was able to compile a table of entries for the 42,400 words identified in the second edition of the Dictionary and the accompanying 115,354 unique quotations. Following Blanchard and Olsen, the most straightforward visualization of the
Dictionary reordered by citations would be a network, where each word is a node, and each edge is a shared citation. In the
Dictionary, however, the sheer number of terms and quotations renders any naïve network graph unusably complex. Johnson’s own text, however, can be used to simplify the connections. For each word that he defines, Johnson also tags it as a part of speech (noun, verb, adjective and adverb). By reducing the articles to these parts of speech, and by consolidating the quotations within these groups, I am able to not only identify large-scale organizational patterns in the
Dictionary but also uncover the different patterns of citation that lie behind these structures.
For example, the works that Johnson cites most distinctively in his definitions of nouns are drawn from a variety of contemporaneous sources: from Dryden and Sidney, who are literary authors, to Boyle (a chemist) and Daniel (a historian) (Table 1).
|Number of Quotations
|Dramatick Poesy; Virgil’s Georgics; Annus Mirabilis
|Colours; Chymical Principles
|Rules for Holy Living; Guide to Devotions
|Fundamentals; Practical Catechism
|Notes on Pope’s Odyssey
|Government of the Tongue
Table 1: Most distinctive authors in the definitions of nouns
By contrast, in his definitions of verbs, Johnson employs predominately Shakespearean tragedy (Macbeth, King Lear, Hamlet, etc.) as well as biblical sources (Genesis, Ecclesiasticus and Job) (Table 2). These patterns indicate that Johnson’s use of citation is not random: there is a clear logic to his choices of quotations for different parts of speech that reflects an implicit theory of the relationship between word meaning and knowledge creation. Here, science is the locus of objects, while tragedy is the locus of action. This not only reveals the ways in which the dictionary organizes language according to an implicit theory of textual meaning, but also how textual meaning is reflexively assigned by the
|Number of Quotations
|Macbeth; King Lear; Coriolanus; Othello; Hamlet; Henry IV
|History of the Turks
|Jane Shore; Royal Convert; Ambitious Stepmother
|Cato; Ovid; Spectator
Table 2: Most distinctive authors in the definition of verbs
The parts of speech reveal an implicit structure to Johnson’s patterns of quotation. Their simplification of the text also allows me to visualize this structure as a meaningful network (Figure 1). In this graph, words are connected by shared authors; however the authors are limited to the group of most distinctive authors for each part of speech group (nouns, adjectives, verbs, adverbs). Similarly, the colors correspond to these groups as well: blue points are adverbs, orange points are adjectives; purple, nouns; and green, verbs. In addition to the four-quadrant structure that corresponds to the four parts of speech, there are other macro-scale phenomena in these graphs as well. Note how nouns are connected to adjectives (as are some verbs), but verbs and adverbs are the least like each other in shared quotations. More importantly, a comparison between this network, and a word embedding analysis of a contemporaneous corpus (the Eighteenth Century Collections Online corpus) using the GloVe algorithm reveals that the association Johnson draws between various terms using quotations are partially reflected in word associations through usage across the century, suggesting that Johnson both drew upon, and influenced, the language use of his time.
Figure 1: Network of Johnson’s
Dictionary limited to authors distinctive of parts of speech
At a much smaller scale, the conceptual patterns that bind together definitions become more apparent. A group of verbs near the edge of the graph captures an overarching thematic set of concerns (Figure 2). “Despise,” “estrange,” “oppress” and “devour” suggest both an affective, as well as thematic unity. There is no reason for these groups to cluster together, unless Johnson sought specifically to create these conceptual groupings. In such cases, the association of words and texts is mutual: just as Shakespeare’s Macbeth provides an illustration for the act of despising, so too does the association of despise (or oppress, estrangement) with Macbeth color our reading of the play. As became apparent in the distinctive words that he uses for quotations, Johnson encodes interpretive theories into the apparently benign act of furnishing illustrative quotations.
Figure 2: Detail of «negative affect» cluster of verbs in part of speech network
Through a computational analysis, the underlying structure of the dictionary reveals the doubled work of lexicography. Quotations are both empirical proofs of contingent meaning, and, in turn, receive meaning themselves through accretion: slow layering of the conceptual unities between the words that they are used to define. The complexities of the dictionary reveal a set of organizing principles embedded in the quotations: this project allows us to extract these patterns and reveal their fundamental influence on the development of, not just lexicography, but language itself across that last three hundred years.