Whose Signal Is It Anyway? A Case Study on Musil for Short Texts in Authorship Attribution
1. State of the art and experimental design
Robert Musil, one of the most important authors of twentieth-century German-written literature, fought in the Austrian army at the Italian front. During WWI, between 1916 and 1917, Musil was chief editor of the Tiroler Soldaten-Zeitung (TSZ) in Bozen. This activity has posed a philological problem to Musil scholars, who have not been able to attribute with certainty a range of texts to the author. However, solving the riddle of authorship for this particular set of texts promises a great advancement in the study of Musil’s political thinking. With this paper, we present a new approach that combines historical and philological research with stylometric methods.
The determination of possible authorship starts with reviewing the literature for previous attempts. There are 38 articles in the TSZ for which Musil’s authorship has been proposed at least once (see Table 1).
|Text #||Title||Date of publication||Attributed by|
|Excl_1||Der Weg zu den Sternen||08.07.1916||C, FL|
|Excl_2||Aus der Geschichte eines Regiments||26.07.1916||C, FL|
|1||Kameraden arbeitet mit!||06.08.1916||A, FL|
|2||Bin ich ein Österreicher?||20.08.1916||A, FL|
|3||Herr Tüchtig und Herr Wichtig||27.08.1916||C, FL|
|4||Das Schlagwort||27.08.1916||A, FL|
|5||Die Erziehung zum Staat||03.09.1916||A, FL|
|Excl_3||Kunst hinter der Front||08.10.1916||C|
|7||Sonderbare Patrioten||15.10.1916||A, FL|
|8||Noch einmal Bauernleben||29.10.1916||C|
|Excl_4||Kannst Du deutsch [III]||12.11.1916||A, FL|
|10||Eine gute persönliche Beziehung||26.11.1916||A, FL|
|11||Eine österreichische Kultur||10.12.1916||R, A, FL|
|12||Der Nörgler und der neue Österreicher||17.12.1916||A, FL|
|13||Das Kompromiß||24.12.1916||A, FL|
|14||Heilige Zeit||31.12.1916||A, FL|
|15||Zentralismus und Föderalismus||07.01.1917||FL|
|16||Föderalismus oder Zentralismus||14.01.1917||FL|
|Excl_6||Kannst Du Deutsch [V]||21.01.1917||A, FL|
|Excl_7||Vorpolitische Reinigung||04.02.1917||A, FL|
|Excl_8||Kannst Du Deutsch [VI]||04.02.1917||A, FL|
|17||Zu Milde und zu Wilde||11.02.1917||A, FL|
|Excl_9||Aus einer öffentlichen Schwulstfabrik||18.02.1917||A, FL|
|Excl_10||Schnucki in der „großen Zeit“||18.02.1917||A, FL|
|19||Ist die »österreichische Frage« schwierig?||04.03.1917||FL|
|20||Seiner Hochwohlgeboren!||04.03.1917||D, A, FL|
|23||Der Frieden versprochen!||18.03.1917||FL|
|24||Das Staatsprogramm der Deutschen||18.03.1917||A, FL|
|25||Wehe dem Staatsmann!||25.03.1917||FL|
|26||Der Frieden und die Zukunft||01.04.1917||FL|
|27||Presse und Krieg||08.04.1917||FL|
|28||Vermächtnis||15.04.1917||D, R, C, A, FL|
Table 1. TSZ articles attributed to Musil, derived from (Schaunig, 2014). D = (Dinklage, 1960); R = (Roth, 1972); C = (Corino, 1973, 2003, and 2010); A = (Arntzen, 1980); FL = (Fontanari and Libardi, 1987).
The major problem for carrying out a stylometric analysis on the texts published in the TSZ is their shortness. As demonstrated by (Eder, 2015), the minimum length for a reliable authorship attribution is around 5,000 words. However, the average length of the 38 disputed TSZ articles is slightly below 1,000 words (see Figure 1).
Figure 1. TSZ articles’ lengths
As a possible solution for this issue, we developed a combinatory design that analyzes longer chunks composed by the juxtaposition of single texts. To reduce the number of combinations, we excluded the 9 shortest texts (below 500 words), together with the only text (Excl_2 in Table 1) that has been solidly attributed to Musil on philological grounds (Corino, 1973). This leaves us with a corpus of 28 texts, already digitized by (Amann et al., 2009). The optimal configuration was obtained by combining groups of 6 texts. This permutation generated 376,740 text chunks with an average length of N=6,963 words and a standard deviation of 909 words.
As for the composition of the training set, we combined the stylometric “impostors method” (Koppel and Winter, 2014) with historiographical research. Following (Juola, 2015), we thus fixed the number of “impostors” to a minimum of three, identifying as likely candidates Franz Blei, Franz Kafka, and Stefan Zweig. In addition, we selected three possible TSZ collaborators: Marie delle Grazie, Hugo Salus, and Albert Ritter (cf. Urbaner, 2001). While all others were digitally available, we manually retrodigitized Ritter’s texts. The training set was completed by a selection of articles authored by Musil in various journals between 1911 and 1919. For each author, the retrieved material was subdivided in three text chunks (length ranging 6,000–8,000 words): the training set was thus composed of 21 text chunks.
2. The Experiment
Validation and experimentation were carried out using the R package Stylo (see Eder et al., 2016). A 20-fold stratified validation had the following results: (1) distance measures (with the exception of Cosine) work slightly better than machine learning algorithms; (2) word-based analysis outperforms 10-character n-grams (cf. Halvani et al., 2016: 39) ; (3) Fig. 2 shows that accuracy levels fluctuate substantially between 10 and 400 MFWs.
|Mean accuracy||(with 10-char. n-grams)|
Table 2. Stratified validation results
Figure 2. Stratified validation results
For these reasons, we limited feature selection to altogether 16 combinations of: (1) the distance measures Burrows’s Delta, Eder’s Delta, Cosine Delta, and Canberra; (2) the frequency strata 10–100 MFWs, 20–200 MFWs, 50–500 MFWs, and 100–1,000 MFWs.
For each iteration, the distances between test set and training set were saved into a matrix. At the end of the process, mean values were calculated. In all 16 configurations, Ritter and Musil are the only candidates for authorship of the TSZ articles. This evidence has been corroborated by the discovery of a document in the Kriegsarchiv in Wien, which confirms that Ritter was part of the TSZ editorial team (see Fig. 3).
Figure 3. Ritter in the TSZ team. Source: Kriegsarchiv, Wien
The stylometric results are synthetized by Fig. 4, which represents only the distances between Musil’s and Ritter’s signals. For highlighting the distinctions, measures were normalized to a range between –1 and +1. A general trend is evident: while, for the articles published in 1916 (articles 1–14), figures point quite clearly to Musil’s authorship, the picture is less clear for the articles published in 1917 (articles 15–28). In no case, however, Ritter’s signal is clearly dominant. Musil thus appears as the most likely author, with the following caveats: First, the combinatory design, while having shown the dominance of Musil’s signal, may have suppressed different, minor signals. Second, Musil, in the role of chief editor, may have altered many articles in the journal, thus intermixing his authorial signal with those of others. By consequence, results that question Musil’s authorship are as a tendency more substantial than those that corroborate it.
Figure 4. Experiment results (full test set)
In a second experiment, we split the corpus in two, applying the same experimental set-up. The first sample just contained those texts that were not attributed to Musil by at least two distance measures (N=14). Here, Ritter appeared as dominant throughout the whole selection. The second sample contained the texts that had been relatively clearly attributed to Musil in the first round (N=14): results show that here, Musil’s signal was even more dominant, with all values close to +1 (see Fig. 5).
Figure 5. Experiment results (split test set)
When further reducing the selection to 9 texts (those for which all classifiers scored less than –0.5 in the previous round, see Fig. 5), all texts were attributed to Ritter with a stronger probability, while the graph generated by the remaining 19 texts was still confined to Musil’s region (see Fig. 6). In answer to our research question, results suggest that Musil attribution may be disproved with a high level of confidence for texts No. 15, 16, 18, 19, 22, 23, 25, 26, and 27 (see Table 1 for details). At the same time, our analysis proposes that Ritter may be the author of these 9 articles.
Figure 6. Experiment results (split test set)
3. Future research
Future expansion of our research should define new training sets to validate the results and increase the test set. Both, however, will require an extensive digitization effort: most of the useful texts (i.e., propagandistic WWI writings) are not available in a clean plain-text format. In addition, other software should be tested on the already defined corpus, e.g., JGAAP (Juola et al., 2008) and CLEF/PAN (Stamatatos et al., 2015), as these consider features and methodologies excluded from the present experiment, such as character n-grams and machine learning.
With our study, we hope to have laid the groundwork for a research that can have long-lasting consequences on the historiography of German literature, evidencing at the same time how quantitative methods are not in opposition, but complementary to the qualitative strands (Herrmann, 2017) of literary history.
- Arntzen, H. (1980). Musil-Kommentar sämtlicher zu Lebzeiten erschienener Schriften außer dem Roman “Der Mann ohne Eigenschaften”. München: Winkler.
- Corino, K. (1973). Robert Musil, Aus der Geschichte eines Regiments. Studi Germanici, 11: 109–15.
- Corino, K. (2003). Robert Musil: eine Biographie. Reinbek bei Hamburg: Rowohlt.
- Corino, K. (2010). Klaviersonnen über Schluchten des Gemüts. Robert Musil und die Musik. Das Plateau, 120: 4–21.
- Dinklage, K. (1960). Robert Musil. Leben, Werk, Wirkung. Zürich: Amalthea Verlag.
- Eder, M., Kestemont, M. and Rybicki, J. (2016). Stylometry with R: a package for computational text analysis. R Journal, 8(1): 107–21.
- Eder, M. (2015). Does size matter? Authorship attribution, small samples, big problem. Digital Scholarship in the Humanities, 30(2): 167–82.
- Fontanari, A. and Libardi, M. (1987). La guerra parallela. Trento: Reverdito.
- Halvani, O., Winter, C. and Pflug, A. (2016). Authorship verification for different languages, genres and topics. Digital Investigation, 16: 33–43.
- Herrmann, J. B. (2017). In test bed with Kafka. Introducing a mixed-method approach to digital stylistics. Digital Humanities Quarterly, [in press].
- Juola, P., Noecker, J., Ryan, M. and Zhao, M. (2008). JGAAP3.0 – authorship attribution for the rest of us. Digital Humanities 2008: Book of Abstracts. Oulu: University of Oulu, pp. 250–51.
- Juola, P. (2015). The Rowling case: a proposed standard analytic protocol for authorship questions. Digital Scholarship in the Humanities, 30: 100–13.
- Koppel, M. and Winter, Y. (2014). Determining if two documents are by the same author. JASIST, 65(1): 178–87.
- Roth, M.-L. (1972). Robert Musil. Ethik und Ästhetik. München: List.
- Schaunig, R. (2014). Der Dichter im Dienst des Generals. Robert Musils Propagandaschriften im Ersten Weltkrieg. Klagenfurt: Kitab.
- Stamatatos, E., Daelemans, W., Verhoeven, B., Juola, P., López-López, A., Potthast, M. and Stein, B. (2015). Overview of the Author Identification Task at PAN 2015. CLEF 2015: Working Notes. http://ceur-ws.org/Vol-1391/inv-pap3-CR.pdf [accessed 26.11.2017].
- Urbaner, R. (2001). “... daran zugrunde gegangen, daß sie Tagespolitik treiben wollte”? Die “(Tiroler) Soldaten-Zeitung” 1915-1917. eForum zeitGeschichte, 3/4. www.eforum-zeitgeschichte.at [accessed 26.11.2017].