Does "Late Style" Exist? New Stylometric Approaches to Variation in Single-Author Corpora

Jonathan Pearce Reeve (jon.reeve@gmail.com), Columbia University, United States of America

Does “late style” exist? That is, do novelists exhibit a well-defined and distinctive stylistic shift as they reach old age, artistic maturity, or both? Edward Said’s On Late Style: Music and Literature Against the Grain argues not only that such a style does exist, but that it has well-defined characteristics. Said describes late style as, somewhat paradoxically, involving “a nonharmonious, nonserene tension, and above all, a sort of deliberately unproductive productiveness going against” (Said 22). The term “late style,” derived from Thodor Adorno’s concept of Beethoven’s Spästil, is one which Adorno conceives of as “catastrophic” (Adorno 567). As Adorno puts it, “the maturity of the late works of significant artists does not resemble the kind one finds in fruit. They are, for the most part, not round, but furrowed, even ravaged. Devoid of sweetness, bitter and spiny, they do not surrender themselves to mere delectation” (564). To determine whether this claim is more than just anecdotally true, it deserves to be experimentally tested. Using new techniques of computational stylometric analysis, I test whether a writer’s late works are statistically dissimilar to the rest of their corpus. I find that late style is not a statistically quantifiable phenomenon. Instead, the opposite is true: the novelists tested exhibit very distinctive early styles.

Twelve single-author corpora were prepared for this study. These include three novelists Said cites at length: Marcel Proust, Thomas Mann, and Jean Genet, as well as nine novelists from the 19th and 20th centuries, chosen for their prolificacy and electronic availability: Charles Dickens, Joseph Conrad, Ernest Hemingway, Henry James, Walter Scott, George Meredith, Willa Cather, Arnold Bennett, and Mary Augusta Ward. Two samples were taken from each novel in these corpora, so that the internal stylistic similarity of the samples serve as a metric check for the validity of the method. These samples were randomly chosen, to ensure that no text is longer than the shortest text in each corpus, and that that the analysis will compare equal amounts of text.

Each of these samples was then vectorized to 500-dimensional vectors, according to their top 500 word frequencies. These samples were then reduced to five dimensions using principal component analysis (PCA). Five dimensions were used here, instead of the usual two, since a cross-validated grid search in a previous study determined this value to be the most effective at clustering documents according to voice and style. This study also introduces two new metrics for stylistic difference. First, the “distinctiveness score” of a novel sample is calculated by determining the distance of the vector from the mean in five-dimensional space, using the Pythogorean theorem. A late novel that shows a high distinctiveness score, therefore, could correctly be called an instance of “late style.”

Second, I introduce a metric representing the “periodicity” of the writer’s style. This is calculated by first inferring pri­or category labels of early, middle, and late using publication years. Then, the novel’s reduced vectors are clustered using a Baysian Gaussian mixture model, which probabilistically infers three or fewer clusters. These assignments are finally compared using a mutual information score, which calculates the similarity of these clusters with the prior inferences, regardless of label. A high periodicity score indicates that a novelist exhibits distinct stylistic periods, whereas a low score indicates that a novelist has a relatively unchanging or unpredictable stylistic progression.

Figure 1: Thomas Mann

Figure 1 shows a projection of the first two dimensions of the vectors generated from Thomas Mann novels. The sizes of the points represent their relative publication years: small circles are early works, and large circles are late works. The colors represent the clusters predicted from the Bayesian Gaussian mixture model. The samples with the highest distinctiveness scores are from his first work Der Kleine Herr Friedemann and his early work Tristan. The samples showing the least distinctiveness, are from Doktor Faustus, the very work Said cites as an example of a distinctive late style.

Figure 2: Marcel Proust

Figure 2 shows the same projection for samples from the works of Marcel Proust. Proust’s first work, Du côté du chez Swann, is the most distinctive. Proust’s last published work, Le temps retrouvé, which Said cites as an example of late style, is in fact very non-distinctive. Proust’s middle works, however, La prisonnière and Albertine disparue, are only intermediary with respect to publication dates, since they were the final novels he wrote. Here, Said is somewhat correct that Proust has a late style, but misidentifies the works that exemplify it. Again, however, Proust’s early style shows a stronger signal than his late.

Figure 3: Charles Dickens

Figure 3 shows vectors generated from Charles Dickens novels. Here again, the early work The Pickwick Papers has the highest distinctiveness score, followed by David Copperfield. Late works like Our Mutual Friend are among the least distinctive. As the alignment of the point colors and sizes here suggests, Dickens shows a strong periodicity. At 0.469, his is the second-highest periodicity score.

AuthorPeriodicity Score
Proust0.023
Meredith0.028
Ward0.166
Cather0.177
Conrad0.177
Bennett0.220
Hemingway0.326
Scott0.360
Mann0.367
Genet0.457
Dickens0.469
James0.472

Table 1

Table 1 shows the periodicity scores of all the novelists studied here. Those novelists with well-known early and late styles, such as James and Dickens, have high periodicity scores. Writers like Proust, on the other hand, whose novels all form part of the series À la recherche du temps perdu, and were all published within about a decade, show the lowest periodicity scores.

This study, beyond simply testing and ultimately disproving the claims of Said and Adorno, provides a framework for stylometric analysis of textual difference, one which could be used to enhance authorship detection techniques and the techniques of forensic text analysis more generally. More experiments are needed, of course, to test the validity of these techniques beyond the domain of literature.


Appendix A

Bibliography
  1. Adorno, T. (2002). Late Style in Beethoven. In: Essays on Music. Berkeley: University of California Press, 2002. pp. 564–568.
  2. Said, E. (2006). On Late Style: Music and Literature Against the Grain. New York: Pantheon Books.