Finding Data in a Literary Corpus: A Curatorial Approach

Brad Rittenhouse (, Georgia Institute of Technology, United States of America and Sudeep Agarwal (, Georgia Institute of Technology, United States of America

PI and Presenter: Brad Rittenhouse

Others involved: Taha Merghani, Sudeep Agarwal, Vidya Iyer, Madison McRoy, Sidharth Potdar, Nate Knauf, and Kevin Kusuma

In this short paper, I will discuss an ongoing text analysis project, which applies NLP, topic modeling, mapping, and other methodologies to the Wright American Fiction corpus. From a theoretical standpoint, the project is an extension of my qualitative work, which tracks a notable historical shift in literary data management strategies through the works of two canonical American writers: Herman Melville and Walt Whitman. Both wrote in New York as it grew from a small market town of around 60,000 residents to a global metropolis of nearly 1,000,000 and had to imagine strategies of data management to integrate newly urban, consumerist surroundings into their writings in an effective, efficient manner. Translating increasingly crowded material realities—populated by people, products, and print—into literary data, these writers illustrate an important ontological shift from the positivist data strategies of the Enlightenment to digital logics of aggregation, organization, and metonymic indexing that increasingly address the impossible scale of modern infosopheres.

As relatively privileged subjects, however, these writers’ very ability to integrate and innovate with this information was largely based upon a free access to information (and indeed information overload) that many contemporaries did not enjoy. In short, critics have historically apportioned literary status upon hegemonic standards of information, with prestigious genres like “encyclopedic writing” preferring masculinist topics and knowledge bases such as ballistics (Pynchon), cetology (Melville), violence (Bolaño) over spheres of knowledge historically more accessible and immediate to women and people of color.

My quantitative work looks to sidestep these biases, using an assortment of natural language processing techniques to recover works from the archive that may be performing similarly impressive literary acts of aggregation, but which critics may have overlooked because the works exist in alternative thematic and affective registers. By measuring the accretion of material information across the corpus, and identifying areas of relative density, my process points to writing which humans readers have overlooked but which machines are able to see as substantially similar to canonical encyclopedic works.

We intentionally made a very broad measurement of the text to identify a broader range of artistic expression. The process itself involves chunking all the texts into 500-word segments, performing a parts-of-speech tag with OpenNLP, then rendering these tags in “baby binary”: a “0” for all non-nouns, a “1” for all nouns. We then summed the segments and divided by the total length of each to obtain a noun density measurement, which generally indicates an aggregation of material information. Though it is possible to use more specific grammatical measures (subjects, objects, etc.), we used nouns at-large so as to capture a fuller spectrum of thought, sentiment, and other immaterial objects that accompany the human masses of urbanization.

We also assembled a fair amount of demographic metadata for the corpus, which has allowed us retrieve relatively forgotten works from the archive. After identifying the densest chunks of text, we attempted to identify author gender with the use of the machine learning platform SexMachine. We cross-referenced these results with those derived from the noun-density analysis to pinpoint female authors of interest. To conduct this analysis, we first performed exploratory data analysis to understand the underlying distribution of noun ratios across the corpus, which appeared to be normally distributed, although with a slight right skew. Then we compared this distribution with that of the noun ratios identified for authors of each gender. The distributions seem to be largely similar. This naturally led to an outlier analysis within each gender, which identifies outliers as works with a noun ratio 1.5 interquartile ranges either above or below the median, yielding 71 outliers for male authors and 47 outliers for female authors (43 and 26 on the high-end, respectively). We then performed additional analyses on these outliers to get a better understanding of what differentiated them from the rest of the corpus.

One case study I will present from among these outliers is that of Emma Wellmont, a nineteenth-century temperance writer who the academy has largely ignored, I suspect because of the emotional, sensationalist overtones of her chosen genre. Nonetheless, her work is quantitatively similar to Walt Whitman’s, with many extracts in the highest quadrant of noun density across the corpus and packed with what the latter evocatively refers to as “stuff.” Unlike Whitman, however, her densest passages are often emotional, pathetic scenes of death and suffering. Critics, if they read Wellmont’s work (and most do not), would likely label it sensationalistic or melodramatic, and therefore, unserious, writing. My methodology, on the other hand, makes an argument for her as an important encyclopedist, albeit of canonically unlikely subject matter. I will present the case study through a prototype interactive visualization that allows users to explore the corpus at-large, all the way down to significant passages within individual works (Figure 1).


Figure 1. Figure

This curatorial process builds upon the methods described by Long and So in their recent article “Literary Pattern Recognition: Modernism between Close Reading and Machine Learning,” using high-powered computing and statistical analysis on a corpus scale to identify information-dense passages for later close reading and analysis. Reading literature as information, the methodology is flexible in not only illuminating macro-scale trends, but also identifying human-readable works and passages for literary critics who also value critical reading practices. The project also runs in parallel to Dennis Yi Tenen’s recent work in its “articulation of ‘effect spaces’ via material density,” though it pulls from a broader range of quantitative, grammatical measures in its attempt to broaden the generic construct of encyclopedic writing.

Appendix A

  1. Long, H. and So, R. (2016). Literary pattern recognition: modernism between close reading and machine learning.
    Critical Inquiry, 42(2): 235-267.
  2. Yi Tenen, D. (2018). Toward a computational archaeology of fictional space.
    New Literary History, 49(1): 119-147.

Deja tu comentario