Whose Signal Is It Anyway? A Case Study on Musil for Short Texts in Authorship Attribution

Simone Rebora (simone.rebora@univr.it), University of Verona, Italy and J. Berenike Herrmann (berenike.herrmann@unibas.ch), University of Basel, Switzerland and Gerhard Lauer (gerhard.lauer@unibas.ch), University of Basel, Switzerland and Massimo Salgaro (massimo.salgaro@univr.it), University of Verona, Italy

1. State of the art and experimental design

Robert Musil, one of the most important authors of twentieth-century German-written literature, fought in the Austrian army at the Italian front. During WWI, between 1916 and 1917, Musil was chief editor of the Tiroler Soldaten-Zeitung (TSZ) in Bozen. This activity has posed a philological problem to Musil scholars, who have not been able to attribute with certainty a range of texts to the author. However, solving the riddle of authorship for this particular set of texts promises a great advancement in the study of Musil’s political thinking. With this paper, we present a new approach that combines historical and philological research with stylometric methods.

The determination of possible authorship starts with reviewing the literature for previous attempts. There are 38 articles in the TSZ for which Musil’s authorship has been proposed at least once (see Table 1).

Text #TitleDate of publicationAttributed by
Excl_1Der Weg zu den Sternen08.07.1916C, FL
Excl_2Aus der Geschichte eines Regiments26.07.1916C, FL
1Kameraden arbeitet mit!06.08.1916A, FL
2Bin ich ein Österreicher?20.08.1916A, FL
3Herr Tüchtig und Herr Wichtig27.08.1916C, FL
4Das Schlagwort27.08.1916A, FL
5Die Erziehung zum Staat03.09.1916A, FL
6Bauernleben01.10.1916C
Excl_3Kunst hinter der Front08.10.1916C
7Sonderbare Patrioten15.10.1916A, FL
8Noch einmal Bauernleben29.10.1916C
9Opportunität12.11.1916FL
Excl_4Kannst Du deutsch [III]12.11.1916A, FL
10Eine gute persönliche Beziehung26.11.1916A, FL
11Eine österreichische Kultur10.12.1916R, A, FL
12Der Nörgler und der neue Österreicher17.12.1916A, FL
13Das Kompromiß24.12.1916A, FL
Excl_5Der Augenzeuge24.12.1916C
14Heilige Zeit31.12.1916A, FL
15Zentralismus und Föderalismus07.01.1917FL
16Föderalismus oder Zentralismus14.01.1917FL
Excl_6Kannst Du Deutsch [V]21.01.1917A, FL
Excl_7Vorpolitische Reinigung04.02.1917A, FL
Excl_8Kannst Du Deutsch [VI]04.02.1917A, FL
17Zu Milde und zu Wilde11.02.1917A, FL
Excl_9Aus einer öffentlichen Schwulstfabrik18.02.1917A, FL
Excl_10Schnucki in der „großen Zeit“18.02.1917A, FL
18Neu-Altösterreichisches25.02.1917A, FL
19Ist die »österreichische Frage« schwierig?04.03.1917FL
20Seiner Hochwohlgeboren!04.03.1917D, A, FL
21Luxussteuern04.03.1917A, FL
22Positive Ziele11.03.1917FL
23Der Frieden versprochen!18.03.1917FL
24Das Staatsprogramm der Deutschen18.03.1917A, FL
25Wehe dem Staatsmann!25.03.1917FL
26Der Frieden und die Zukunft01.04.1917FL
27Presse und Krieg08.04.1917FL
28Vermächtnis15.04.1917D, R, C, A, FL

Table 1. TSZ articles attributed to Musil, derived from (Schaunig, 2014). D = (Dinklage, 1960); R = (Roth, 1972); C = (Corino, 1973, 2003, and 2010); A = (Arntzen, 1980); FL = (Fontanari and Libardi, 1987).

The major problem for carrying out a stylometric analysis on the texts published in the TSZ is their shortness. As demonstrated by (Eder, 2015), the minimum length for a reliable authorship attribution is around 5,000 words. However, the average length of the 38 disputed TSZ articles is slightly below 1,000 words (see Figure 1).

Figure 1. TSZ articles’ lengths

As a possible solution for this issue, we developed a combinatory design that analyzes longer chunks composed by the juxtaposition of single texts. To reduce the number of combinations, we excluded the 9 shortest texts (below 500 words), together with the only text (Excl_2 in Table 1) that has been solidly attributed to Musil on philological grounds (Corino, 1973). This leaves us with a corpus of 28 texts, already digitized by (Amann et al., 2009). The optimal configuration was obtained by combining groups of 6 texts. This permutation generated 376,740 text chunks with an average length of N=6,963 words and a standard deviation of 909 words.

As for the composition of the training set, we combined the stylometric “impostors method” (Koppel and Winter, 2014) with historiographical research. Following (Juola, 2015), we thus fixed the number of “impostors” to a minimum of three, identifying as likely candidates Franz Blei, Franz Kafka, and Stefan Zweig. In addition, we selected three possible TSZ collaborators: Marie delle Grazie, Hugo Salus, and Albert Ritter (cf. Urbaner, 2001). While all others were digitally available, we manually retrodigitized Ritter’s texts. The training set was completed by a selection of articles authored by Musil in various journals between 1911 and 1919. For each author, the retrieved material was subdivided in three text chunks (length ranging 6,000–8,000 words): the training set was thus composed of 21 text chunks.

2. The Experiment

Validation and experimentation were carried out using the R package Stylo (see Eder et al., 2016). A 20-fold stratified validation had the following results: (1) distance measures (with the exception of Cosine) work slightly better than machine learning algorithms; (2) word-based analysis outperforms 10-character n-grams (cf. Halvani et al., 2016: 39) ; (3) Fig. 2 shows that accuracy levels fluctuate substantially between 10 and 400 MFWs.

 Mean accuracy(with 10-char. n-grams)
Delta99.1698.96
Eder98.5898.57
Canberra99.3799.24
Cosine95.0395.40
Cosine Delta98.9098.79
SVM98.5698.46
k-NN95.2894.95
NSC95.5595.34

Table 2. Stratified validation results

Figure 2. Stratified validation results

For these reasons, we limited feature selection to altogether 16 combinations of: (1) the distance measures Burrows’s Delta, Eder’s Delta, Cosine Delta, and Canberra; (2) the frequency strata 10–100 MFWs, 20–200 MFWs, 50–500 MFWs, and 100–1,000 MFWs.

For each iteration, the distances between test set and training set were saved into a matrix. At the end of the process, mean values were calculated. In all 16 configurations, Ritter and Musil are the only candidates for authorship of the TSZ articles. This evidence has been corroborated by the discovery of a document in the Kriegsarchiv in Wien, which confirms that Ritter was part of the TSZ editorial team (see Fig. 3).

Figure 3. Ritter in the TSZ team. Source: Kriegsarchiv, Wien

The stylometric results are synthetized by Fig. 4, which represents only the distances between Musil’s and Ritter’s signals. For highlighting the distinctions, measures were normalized to a range between –1 and +1. A general trend is evident: while, for the articles published in 1916 (articles 1–14), figures point quite clearly to Musil’s authorship, the picture is less clear for the articles published in 1917 (articles 15–28). In no case, however, Ritter’s signal is clearly dominant. Musil thus appears as the most likely author, with the following caveats: First, the combinatory design, while having shown the dominance of Musil’s signal, may have suppressed different, minor signals. Second, Musil, in the role of chief editor, may have altered many articles in the journal, thus intermixing his authorial signal with those of others. By consequence, results that question Musil’s authorship are as a tendency more substantial than those that corroborate it.

Figure 4. Experiment results (full test set)

In a second experiment, we split the corpus in two, applying the same experimental set-up. The first sample just contained those texts that were not attributed to Musil by at least two distance measures (N=14). Here, Ritter appeared as dominant throughout the whole selection. The second sample contained the texts that had been relatively clearly attributed to Musil in the first round (N=14): results show that here, Musil’s signal was even more dominant, with all values close to +1 (see Fig. 5).

Figure 5. Experiment results (split test set)

When further reducing the selection to 9 texts (those for which all classifiers scored less than –0.5 in the previous round, see Fig. 5), all texts were attributed to Ritter with a stronger probability, while the graph generated by the remaining 19 texts was still confined to Musil’s region (see Fig. 6). In answer to our research question, results suggest that Musil attribution may be disproved with a high level of confidence for texts No. 15, 16, 18, 19, 22, 23, 25, 26, and 27 (see Table 1 for details). At the same time, our analysis proposes that Ritter may be the author of these 9 articles.

Figure 6. Experiment results (split test set)

3. Future research

Future expansion of our research should define new training sets to validate the results and increase the test set. Both, however, will require an extensive digitization effort: most of the useful texts (i.e., propagandistic WWI writings) are not available in a clean plain-text format. In addition, other software should be tested on the already defined corpus, e.g., JGAAP (Juola et al., 2008) and CLEF/PAN (Stamatatos et al., 2015), as these consider features and methodologies excluded from the present experiment, such as character n-grams and machine learning.

With our study, we hope to have laid the groundwork for a research that can have long-lasting consequences on the historiography of German literature, evidencing at the same time how quantitative methods are not in opposition, but complementary to the qualitative strands (Herrmann, 2017) of literary history.


Appendix A

Bibliography
  1. Amann, K., Corino, K. and Fanta, W. (2009). Robert Musil, Klagenfurter Ausgabe. Klagenfurt: Robert Musil-Institut der Universität Klagenfurt.
  2. Arntzen, H. (1980). Musil-Kommentar sämtlicher zu Lebzeiten erschienener Schriften außer dem Roman “Der Mann ohne Eigenschaften”. München: Winkler.
  3. Corino, K. (1973). Robert Musil, Aus der Geschichte eines Regiments. Studi Germanici, 11: 109–15.
  4. Corino, K. (2003). Robert Musil: eine Biographie. Reinbek bei Hamburg: Rowohlt.
  5. Corino, K. (2010). Klaviersonnen über Schluchten des Gemüts. Robert Musil und die Musik. Das Plateau, 120: 4–21.
  6. Dinklage, K. (1960). Robert Musil. Leben, Werk, Wirkung. Zürich: Amalthea Verlag.
  7. Eder, M., Kestemont, M. and Rybicki, J. (2016). Stylometry with R: a package for computational text analysis. R Journal, 8(1): 107–21.
  8. Eder, M. (2015). Does size matter? Authorship attribution, small samples, big problem. Digital Scholarship in the Humanities, 30(2): 167–82.
  9. Fontanari, A. and Libardi, M. (1987). La guerra parallela. Trento: Reverdito.
  10. Halvani, O., Winter, C. and Pflug, A. (2016). Authorship verification for different languages, genres and topics. Digital Investigation, 16: 33–43.
  11. Herrmann, J. B. (2017). In test bed with Kafka. Introducing a mixed-method approach to digital stylistics. Digital Humanities Quarterly, [in press].
  12. Juola, P., Noecker, J., Ryan, M. and Zhao, M. (2008). JGAAP3.0 – authorship attribution for the rest of us. Digital Humanities 2008: Book of Abstracts. Oulu: University of Oulu, pp. 250–51.
  13. Juola, P. (2015). The Rowling case: a proposed standard analytic protocol for authorship questions. Digital Scholarship in the Humanities, 30: 100–13.
  14. Koppel, M. and Winter, Y. (2014). Determining if two documents are by the same author. JASIST, 65(1): 178–87.
  15. Roth, M.-L. (1972). Robert Musil. Ethik und Ästhetik. München: List.
  16. Schaunig, R. (2014). Der Dichter im Dienst des Generals. Robert Musils Propagandaschriften im Ersten Weltkrieg. Klagenfurt: Kitab.
  17. Stamatatos, E., Daelemans, W., Verhoeven, B., Juola, P., López-López, A., Potthast, M. and Stein, B. (2015). Overview of the Author Identification Task at PAN 2015. CLEF 2015: Working Notes. http://ceur-ws.org/Vol-1391/inv-pap3-CR.pdf [accessed 26.11.2017].
  18. Urbaner, R. (2001). “... daran zugrunde gegangen, daß sie Tagespolitik treiben wollte”? Die “(Tiroler) Soldaten-Zeitung” 1915-1917. eForum zeitGeschichte, 3/4. www.eforum-zeitgeschichte.at [accessed 26.11.2017].