Searching for Concepts in Large Text Corpora: The Case of Principles in the Enlightenment

Stephen Osadetz (osadetz@fas.harvard.edu), Harvard University, United States of America and Kyle Courtney (kyle_courtney@harvard.edu), Harvard University, United States of America and Claire DeMarco (claire_demarco@harvard.edu), Harvard University, United States of America and Cole Crawford (cole_crawford@fas.harvard.edu), Harvard University, United States of America and Christine Fernsebner Eslao (eslao@fas.harvard.edu), Harvard University, United States of America

At the beginning of a research project, every scholar in the humanities asks a question: what should I read? This paper presents a new search engine that sifts through large corpora of unstructured text in order to find particular passages that deal with a concept of interest. Its underlying algorithm is based on the practice of concept search, which was originally developed in legal practice to efficiently automate the process of document review (Blair, 1985; Bai, 2005; Algee-Hewitt, 2008; Zhu, 2009; King, 2017). Our search engine builds upon that technique by applying it to large corpora relevant to humanistic scholarship and, crucially, dividing each text into passages of a standard size, improving the specificity of results. In order to demonstrate the fullest potential of this technique, our paper will focus upon its application to a specific problem in eighteenth-century intellectual history, while also discussing its most significant theoretical implications, including a reevaluation of the tight connection that has developed between the computational evaluation of the great unread through distant reading (Cohen, 1999; Moretti, 2013).

Concept search for large text corpora

Our team has expanded the paradigm of text discovery by developing a search engine that sifts through large digitized corpora to identify passages that deal with particular concepts. The algorithm itself is simple. After reading deeply in a subject, a researcher gathers together a number of passages from various sources that focus on a particular concept of interest. The researcher then uses a word-frequency analyzer to judge which of the remaining terms are most important to describing his or her concept. That keyword set is used to perform a search. The search uses a number of statistical measures to determine which items are likely to be most relevant, and produces results either as a downloadable csv spreadsheet or a GUI. The researcher can then read through the results in a preliminary fashion to judge whether they are satisfactory. If they are not, the process can be iterated.

Concept search is a statistically-based method of information retrieval that has been adopted widely in legal practice and the business world, but that is only rarely used in the humanities. Concept search and keyword search are not the same. Classical Boolean keyword search queries are predicated on single terms and phrases. Concept search, on the other hand, searches for a cluster of terms drawn from a loading set of passages identified by a researcher, which potentially encode the underlying semantic quality of the passages in that set. (It should be said, this sense of “concept” is statistical and highly artefactual, quite different from the sense of the term in psychology, linguistics, or philosophy.) Keyword search simply looks for matches between query terms and the documents in a corpus. Concept search, on the other hand, does not require the occurrence of any one of its search terms, instead relying on a variety of statistical measures to judge which passages in the search corpus are most likely to be useful. Keyword search often returns non-relevant items because of the problems of synonymy and polysemy. Concept search attempts to overcome these problems, as well as that of OCR error, by representing a concept as the statistical likelihood of the occurrence of the cluster of terms in its query set.

What makes this search especially useful is that results are ordered, not by volume, but by which passage is most likely to be relevant to a particular concept. We originally developed our search engine to search through Cengage-Gale’s Eighteenth Century Collections Online database. In doing so, we divided its volumes into 16 million passages, each of 1,000 words. Compare this technique to the traditional method of identifying which volumes to consult. It is akin to asking a librarian for material relevant to a research topic, and having that librarian not only identify which books are likely to be of interest, but also opening each volume to the particular page that most clearly deals with that topic.

Each set of results is essentially an index of hundreds of thousands of passages, sequentially ordered by which are most likely to be relevant. Instead of only displaying page after page of search results that must be consulted one by one, it offers researchers the option of downloading a single spreadsheet that is easy to filter by author, work, and date, and that allows for a quick, global view of which texts are relevant. In order to make sense of this data, one has to sort and filter judiciously. We have developed a number of standard search filters that can be used repeatedly to select for literary works, canonical works across disciplines, and an author’s gender. Or, should a researcher prefer, she can quickly select for one, two, or fifty authors that she would like to examine. Sorting is equally important. The most basic statistical measure is term frequency, which counts each time a keyword appears in a particular passage. The search engine also allows for sorting by other statistical measures, including the proportion of keywords (useful if there is a great deal of error in a passage due to the process of digitization, or if there is a relatively high number of stopwords), tf–idf.

Searching for eighteenth-century principles and theoretical implications

In order to demonstrate the fullest potential of this technique, we present a use case that concerns a specific problem in intellectual history. In the eighteenth century, many of the most famous authors obsessed over the possibility of encapsulating a whole book, or even an entire discipline, in a single proposition called a principle. Among these are the best-known statements of the Enlightenment, including Newton’s inverse-square law of gravity and Kant’s categorical imperative. David Hume put it succinctly. A principle, he said, offers “a whole science in a single theorem” (Hume, 1987). By promising to encapsulate and disseminate an author’s most fundamental ideas, the principle became the preeminent intellectual device of the Enlightenment. In some cases, however, less famous principles might lie buried in books, some obscure. The most common research tools and methods – long reading, critical bibliographies, and Library of Congress subject headings – would be of only limited help in discovering these. Our search engine provides an efficient means of discovering passages in which authors frame principles and reflect upon the consequences of this rhetorical obsession.

In accounting for the eighteenth-century enthusiasm for principles, one important question that has been difficult to answer using traditional research methods is what women thought of this rhetorical habit. The principle is undeniably masculinist (etymologically, the word itself is tied to seeds and semen), and the prevailing assumption in the period was that men framed principles, so much so that Rousseau claimed in his pedagogical treatise Émile that women lacked the ability to generalize their ideas. By applying standard filters to our search results, we are able to efficiently identify important passages in works by female authors, in which they chafed against the pervasive insistence that women were shut out from the culture of the principle.

The final section of our paper concerns the theoretical implications of the tool we have developed. We interrogate the common denigration of search that many digital humanists have voiced (Moretti, 2007; Berry, 2012; Jockers, 2013), while also questioning the tight connection that has developed between the critical concepts of “the great unread” and “distant reading” (Moretti, 2000; Sculley and Pasanek, 2008; Trumpener, 2009). Our position is that many of the strongest examples of digital scholarship treat the great unread as a textual noumenon that can only be approached obliquely and in its totality, through the analysis of minimal textual features. The search engine we have developed allows computational methods to invigorate more traditional modes of reading by helping researchers to quickly draw together a project-specific list of works that comprise canonical and non-canonical material. As such, it promises to open new avenues for research, both for those scholars committed to historicist methods, those exploring alternatives to critical modes of reading (Best and Marcus, 2009; Shore, 2010; Pasanek, 2015), and those who wish to rethink scholar’s pervasive recourse to context, in an effort to trace how ideas move across history (Felski, 2015).


Appendix A

Bibliography
  1. Algee-Hewitt, M. (2017).  The Afterlife of the Sublime: Toward a New History of Aesthetics in the Long Eighteenth Century. ProQuest Dissertations and Theses.
  2. Bai, J., Song, D., Bruza, P., Nie, J-Y., and Cao, G. (2005). "Query Expansion Using Term Relationships in Language Models for Information Retrieval."  Proceedings of the 14th ACM International Conference on Information and Knowledge Management, 688-95.
  3. Berry, D. M. (2012).  Understanding Digital Humanities. Palgrave Macmillan.
  4. Best, S. and Marcus, S. (2009). “Surface Reading: An Introduction.” Representations, 1-21.
  5. Blair, D., and Maron, M. (1985). “An Evaluation of Retrieval Effectiveness for a Full-Text Document-Retrieval System” Communications of the ACM: 289-99.
  6. Cohen, M. (1999).  The Sentimental Education of the Novel. Princeton University Press.
  7. De Bolla, P. (2013).  The Architecture of Concepts: The Historical Formation of Human Rights. Fordham University Press.
  8. Felski, R. (2015).  The Limits of Critique. University of Chicago Press.
  9. Hastie, T., Tibshirani, R., and Friedman, J. H. (2009).  The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer.
  10. Hume, D. (1987). Miller, Eugene F., ed.  Essays, Moral, Political, and Literary. Rev. ed. Liberty Fund, 1987.
  11. Jockers, M. L. (2013).   Macroanalysis :Digital Methods and Literary History. Topics in the Digital Humanities. University of Illinois Press.
  12. King, G., Lam, P., and Roberts, M. E. (2017). “Computer-Assisted Keyword and Document Set Discovery from Unstructured Text.” American Journal of Political Science: 289-99.
  13. King, G., Pan, J., and Roberts, M. E. (2017). "Reverse-engineering Censorship in China: Randomized Experimentation and Participant Observation."  Science. 345.6199: 1251722.
  14. Moretti, F. (2000). "Conjectures on World Literature."  New Left Review 1.54.
  15. Moretti, F. (2013). Distant Reading. Verso.
  16. Moretti, F. (2009). "Style, Inc. Reflections on Seven Thousand Titles (British Novels, 1740–1850)."  Critical Inquiry. 36.1: 134-58.
  17. Pasanek, B. (2015).  Metaphors of Mind: An Eighteenth-Century Dictionary. Johns Hopkins University Press.
  18. Rousseau, J.-J. Bloom, A., ed. (1979).  Emile : Or, On Education. Basic Books.
  19. Sculley, D., and Pasanek. B. M. (2008). "Meaning and Mining: The Impact of Implicit Assumptions in Data Mining for the Humanities."  Literary and Linguistic Computing. 23.4: 409-24.
  20. Shore, D. (2010). "WWJD? The Genealogy of a Syntactic Form."  Critical Inquiry. 37.1: 1-25.
  21. Trumpener, K. (2009). "Critical Response I. Paratext and Genre System: A Response to Franco Moretti."  Critical Inquiry. 36.1: 159-71.
  22. Zhu, X., and Goldberg, A. B. (2009). "Introduction to Semi-Supervised Learning."  Synthesis Lectures on Artificial Intelligence and Machine Learning. 3:1: 1-130.