Stranger Genres: Computationally Classifying Reprinted Nineteenth Century Newspaper Texts

Jonathan D. Fitzgerald (, Northeastern University, United States of America and Ryan Cordell (, Northeastern University, United States of America

Since its inception in 2012, the Viral Texts Project has identified several millions of reprinted texts from corpora of nineteenth-century newspapers. The project began with the aim of isolating texts worthy of closer academic scrutiny from the “big data” of scanned newspapers, but the project’s derived data is itself now so big that it cannot be effectively studied through browsing and reading alone. This poster describes our efforts to theorize and implement one solution to this challenge, through computational classification that identifies reprinted texts by genre. The poster will also share a prototype crowd-sourcing experiment that creates a bridge between computational research and various publics by encouraging scholars, students, journalists, and others to explore the strange genres of the nineteenth-century newspaper while enhancing our ground-truth data for training improved classifiers. Following other scholars who affirm the importance of human judgment in computational text analysis (Underwood, 2017; Klein, 2014; Long and So, 2015), our classification method employs unsupervised and supervised modelling: topic modeling and principal component analysis to group similar texts within a training set and generalized linear modelling to sort additional texts from the larger corpus. When the PCA data are visualized in three dimensional space, they cluster around four centers, which, upon closer inspection, can be described as four discrete but overlapping genres: news, advertisements, informational pieces, and literary pieces. Our GLM-based classifier—trained on data derived from PCA and confirmed by human readers—has been successful at finding and identifying thousands of previously unclassified texts in each of these genres.

These early experiments are helping our team more effectively isolate particular genres of texts for deeper literary-historical study, but these experiments are perhaps more valuable for the ways they are helping us reconsider our notions of genre itself in nineteenth century newspapers. Genres, as noted by other scholars who use computational methods to classify texts by genre (Schöch, 2017), are highly complex and fluid through time. In an effort to avoid presentist or anachronistic readings of genre, we dispense with conceptions of journalistic genres drawn from twentieth- and twenty-first-century newspapers, and attend instead to the much more complex reality of the nineteenth century newspaper. For example, among the texts found in the “literary” category, we’ve identified many examples of what we name “vignettes”—short prose pieces that are a hybrid of fact and fiction, moral lesson and humorous anecdote. Vignettes of this kind are quite remote from contemporary journalistic genres, and yet we theorize that vignettes encapsulate the hybrid nineteenth century periodical press.

To make our classification efforts accessible to wider publics—and following other scholars who have done likewise in recent years (Beals, 2017; Mullen, 2016)—we have created a crowd-sourcing web application. This app, “ The Amazing Generic Automaton,” creates accessible paths into our work by allowing users to read a text alongside its most probable genre according to our classifier, asking users to determine whether our classifier has correctly identified the genre. If a user agrees with the classification, she simply clicks “Yes” to confirm; if, however, the genre does not appear to describe the text, the user may select “No” and a list of other genres, listed in the order of their probability as determined by the classifier, appear. The user can then select another genre, or instead choose “other,” with a prompt to specify how she might classify the text. The results are saved as CSVs, which, when combined, constitute a new training set for Viral Texts. This app, in addition to confirming some of our classification efforts and providing a larger set of ground-truth data, fulfills a major goal of our work: it makes relatively complex computational work more accessible, thus adding a public face to our scholarship. For other humanities scholars less familiar with computational approaches, this app helps them see classification not as a “binary” decision, but instead as a constellation of overlapping generic probabilities.

The poster we propose will outline our process, describe what we’re learning about genre in the nineteenth-century periodical press, present early results and visualizations, and offer conference attendees the opportunity to try out “ The Amazing Generic Automaton.” We expect our presentation will lead to meaningful conversations about innovative approaches to genre classification, the nature of literary genre situated in specific historical periods, and the benefits of creating bridges between complex computational text analysis work and the public.

Appendix A

  1. Beals, M. H. (2017). Scissors-and-Paste-O-Meter Officially Launched for 1800-1900 (accessed 28 November 2017).
  2. Klein, L. F. (2014). Talk at Digital Humanities 2014 Lauren F. Klein (accessed 28 November 2017).
  3. Long, H. and So, R. J. (2015). Literary Pattern Recognition: Modernism between Close Reading and Machine Learning. Critical Inquiry, 42(2): 235–67 doi:10.1086/684353.
  4. Mullen, L. (2016). America’s Public Bible: Biblical Quotations in U.S. Newspapers (accessed 17 April 2018).
  5. Schöch, C. (2017). Topic Modeling Genre: An Exploration of French Classical and Enlightenment Drama. Digital Humanities Quarterly, 11(2)
  6. Underwood, T. (2017). We’re probably due for another discussion of Stanley Fish The Stone and the Shell (accessed 28 November 2017).