The CLiGS Textbox

José​ Calvo Tello (jose.calvo@uni-wuerzburg.de), Universität Würzburg, Germany and Ulrike Henny-Krahmer (ulrike.henny@uni-wuerzburg.de), Universität Würzburg, Germany and Christof Schöch (schoech@uni-trier.de), Universität Trier, Germany and Katrin Betz (katrin.betz@uni-wuerzburg.de), Universität Würzburg, Germany

1. Introduction

This poster presents the textbox 1 published by the CLiGS 2 group both from the perspective of creating the textbox and of using it for research. The poster will highlight key aspects of the manner in which the collections of literary texts included in the textbox have been compiled, annotated and published. Furthermore it suggests several ways in which the text collections can be used for research in literary studies. This poster aims to showcase the unique XML-TEI-based collections we make available and to encourage their re-use by others.

2. What is the textbox?

The CLiGS textbox is dedicated to making collections of literary texts in Romance languages freely available. It currently contains novels, novellas and short stories published between 1830 and 1940 in France 3 , Spain, Italy, Portugal, and Spanish-America as well as plays published between 1640 and 1680 in France with a total of 357 texts or about 14 million words. The texts are published in XML-TEI as well as in plain text versions and include detailed document-level metadata. All texts are in the public domain and the XML-TEI markup including the metadata is published with a Creative Commons Attribution license (CC-BY) or in case of the Italian novels with a NC-SA-BY. The text collections are curated and published using a public GitHub repository. In addition, main releases are automatically archived on Zenodo.org, a long-term data and publications archiving service for researchers across Europe managed by OpenAire and supported by CERN (see Nielsen, 2013). Each release receives a DOI (Digital Object Identifier), providing the unambiguous identification and long-term availability of the resource.

3. Text selection

The individual text collections were created with various usage scenarios in mind, and each collection has been compiled in a slightly different manner. For example, the two collections of Spanish novels, the Corpus of Spanish Novels (1880-1940) and the Collection of 19th century Spanish-American Novels (1880-1916), have been prepared to be used for authorship attribution. Accordingly, the two collections have been balanced with regard to the number of texts from different authors. The poster will give an overview of the sub-collections of the textbox and also about the principles guiding their compilations.

4. File Formats

Independently of their original source format (e.g. html or EPUB), the texts are prepared (with Python scripts or XSLT) according to a common TEI schema established by the CLiGS group 4 . In addition to that reference format, each collection is made available in a simple plain text format automatically derived from the XML-TEI version, containing only the text included in the body of the novels and plays (in particular, excluding prefaces, other paratext, or notes) and with external metadata provided in tabular format.

Moreover, the collections of French, Spanish, Spanish-American, and Portuguese novels as well as the Italian short stories are made available in a version combining basic structural markup (chapter and sentence divisions) with token-level linguistic annotation, including lemma, part-of-speech, morphology, and basic semantic annotation using Freeling (cf. Padró and Stanislovsky, 2012) and WordNet (see Figure 1). Finally, the collection of French plays is not only available in XML-TEI, but also in the “Zwischenformat” developed by the DLINA group (Kampkaspar et al., 2015).

 Linguistic annotations in an XML format that is a minimal departure from the TEI standard to allow multiple token-level annotations
Figure 1. Linguistic annotations in an XML format that is a minimal departure from the TEI standard to allow multiple token-level annotations

5. Metadata

Besides the administrative metadata like license, responsibility etc. the collections focus on descriptive metadata. There are four main areas about which information is documented: metadata concerning the authorship (VIAF, name, country, gender), metadata concerning the literary work and editions (VIAF or other identifier, extent of the texts, print and the digital source), and finally metadata concerning the genre: Since the main focus of the project is literary genre, a considerable part of the metadata is directly connected to it. Any reference to genre in the title of the work is collected as a genre label. Besides that, a hierarchical system is used, comprising supergenre (e.g. “narrative” or “drama”), genre (that is, novels or novellas), subgenre (the subtype of the novel, for example “adventure novel” or “political novel”) and subsubgenre (optional, used for further differentiations like “war novel”).

6. Usage Scenarios

There are many possible use cases for the textbox collections. The poster will demonstrate some results of these methods from the areas of authorship attribution (using the stylo package for R; Eder et al., 2016), network analysis (using NetworkX in Python), and topic modeling (using MALLET with “tmw” for Python). These scenarios are intended not only as examples of analyses conducted within the CLiGS group, but also as suggestions for potential users of the CliGS textbox, Figure 2 and 3 demonstrate some results for authorship attribution and network analysis.

 Authorship attribution, results of cosine delta on the Corpus of Spanish Novels (cf. Smith and Alridge, 2011; Evert et al., 2017)
Figure 2. Authorship attribution, results of cosine delta on the Corpus of Spanish Novels (cf. Smith and Alridge, 2011; Evert et al., 2017)
 Character network based on number of words spoken in mutual presence (represented by the thickness of the lines), for Jean Racine's tragedy Britannicus (1669)
Figure 3. Character network based on number of words spoken in mutual presence (represented by the thickness of the lines), for Jean Racine's tragedy Britannicus (1669)

Appendix A

Bibliography
  1. E der, M., Kestemont, M. and Rybicki, J. (2016). Stylometry in R: A package for computational text analysis. In The R Journal, 16 (1): 1-15.
  2. E vert, S., Proisl, T., Jannidis, F., Reger, I., Pielström, S., Schöch, C. and Vitt, T. (2017). Understanding and explaining Delta measures for authorship attribution. In Digital Scholarship in the Humanities, 32 (suppl_2): ii4-ii16. doi: 10.1093/llc/fqx023 https://academic.oup.com/dsh/article/doi/10.1093/llc/fqx023/3865676/Understanding-and-explaining-Delta-measures-for (accessed April 26 2018).
  3. K ampkaspar, D., Fischer, F. and Trilcke, P. (2015). Introducing Our ‘Zwischenformat’. In Network Analysis of Dramatic Texts. https://dlina.github.io/Introducing-Our-Zwischenformat/ (accessed April 26 2018).
  4. Nielsen, L. H. (2013). ZENODO – An innovative service for sharing all research outputs. In Zenodo. doi: 10.5281/zenodo.6815 http://doi.org/10.5281/zenodo.6815 (accessed April 26 2018).
  5. Padró, L. and Stanislovsky, E. (2012). FreeLing 3.0: Towards Wider Multilinguality. In Proceedings of the Language Resources and Evaluation Conference (LREC 2012). Istanbul, Turkey: ELRA. http://nlp.lsi.upc.edu/publications/papers/padro12.pdf (accessed April 26 2018): 2473-2479.
  6. Smith, P. W. H. and Alridge, W. (2011). Improving Authorship Attribution: Optimizing Burrows’ Delta Method. In Journal of Quantitative Linguistics, 18(1): 63-88. doi: 10.1080/09296174.2011.533591.
Notes
2.
CliGS (Computational Literary Genre Stylistics) is an early career research group funded by the German Federal Ministry for Research and Education (BMBF). http://www.cligs.hypotheses.org.
3.
The French collection was partly compiled and annotated by Stefanie Popp.