The re-creation of Harry Potter: Tracing style and content across novels, movie scripts and fanfiction

Marco Büchler (, University of Göttingen, Germany and Greta Franzini (, University of Göttingen, Germany and Mike Kestemont (, University of Antwerp, Belgium and Enrique Manjavacas (, University of Antwerp, Belgium

1. The tutors

This one-day tutorial will be given by Marco Büchler, Greta Franzini, Mike Kestemont and Enrique Manjavacas.

Endorsement: This workshop is formally endorsed by the Special Interest Group on Digital Literary Stylistics (SIG-DLS).

Mike Kestemont ( is assistant research professor in the department of Literature at the University of Antwerp. He specializes in computational text analysis for the Digital Humanities, in particular stylometry and machine learning, topics on which he has given dozens of hands-on courses. Whereas his work has a strong focus on historical literature, his present research projects cover a wide range of topics in literary history, including classical, medieval, early modern and modernist texts. Mike currently takes a strong interest in representation learning via neural networks.

Marco Büchler ( is a computer scientist and leader of the Electronic Text Reuse Acquisition Project (eTRAP) research group at the University of Göttingen. Marco’s research interests concern the processing of natural languages with a specialization in the detection of historical text reuse. Furthermore, he is interested in the mining process and the systematization of changes of text reuse. He has worked in this field for over eight years. Together with his eTRAP team, in the past three years he has organized ten text reuse tutorials.

Greta Franzini ( is a Classicist and member of the Electronic Text Reuse Acquisition Project (eTRAP) research group at the University of Göttingen. Greta’s research interests concern the production of digital editions of texts as well as the combination of quantitative and qualitative methods to advance computational analyses and linguistic resources for Classical literature. Together with her team, Greta has already given eight text reuse tutorials.

Enrique Manjavacas ( is a PhD student at the University of Antwerp. He is associated with the Antwerp Centre for Digital Humanities and Literary Criticism. His current research focuses on sequential methods based on recurrent neural networks to develop semantically-infused models for Stylometry and text reuse detection. He is also interested in Natural Language Generation and has been involved in various projects around the concept of Synthetic Literature.

2. Description

Computer-assisted text analysis is a core research area in the Digital Humanities. It embraces a wide variety of applications (stylometry, text reuse detection, topic modelling, etc.) and can assist researchers in complex tasks, particularly when it comes to processing large amounts of text. This tutorial brings together two popular and complementary text analysis tasks, stylometry (the quantitative study of writing style) and text reuse detection. While stylometry typically focuses on stylistic similarities between texts (i.e. how texts are written), text reuse studies are geared towards the reuse of elements across works (i.e. what texts are written about). As such, both methodologies tie into the theoretical notion of intertextuality (Orr 2003), albeit in complementary ways .

Creativity and individuality are important phenomena at stake in both fields: are writers at liberty to escape their own ‘stylome’ - or unique stylistic fingerprint - and to which extent can they free themselves from the many predecessors to which they are intertextually indebted? (Harold Bloom (1973) famously spoke of the ‘Anxiety of Influence’ in this respect) This leads to interesting theoretical tensions: if authors are stylistically close to one another, does that imply that we can also expect a more elevated level of text reuse between them (and vice versa)? Or can authors frequently reuse textual elements while developing an independent stylistic profile? To which extent is it theoretically possible to oppose style and content?

In this workshop we offer a hands-on introduction to these topics using the case study of Rowling’s Harry Potter novels. The vast body of academic scholarship of these writings attests to the relevance of this series, including the highly mediatized stylometric study by Patrick Juola (2013) unmasking Rowling as “Robert Galbraith”, the pseudonym under which she temporarily managed to escape her own fame. Intertextuality is also a major concern of Rowling scholarship and scholars as Karin Westman (2007) have meticulously analyzed Rowling’s nuanced indebtedness to British authors such as Jane Austen. Rowling herself has invited much intertextual offspring by now too, not in the least in the form of so-called fanfiction (Milli & Bamman 2016), the global phenomenon where (typically non-professional) writers read, reinterpret and expand literary universes ( fandoms) originally created by acclaimed authors in their own writings (or fanfics).

The workshop’s tutorial will focus on offering scholars the practical tools and skills to begin to tackle such complex issues. For text reuse detection, participants will learn how to operate TRACER, a language-independent suite of state-of-the-art Natural Language Processing (NLP) algorithms aimed at discovering text reuse in both historical and modern texts, helping users to identify different types of text reuse ranging from verbatim quotations to paraphrase. For the stylometric analyses and visualizations, participants will mainly use custom scripts that exploit the numerous possibilities of the popular Python library scikit-learn for Machine Learning. Stylometry with R (Eder et al. 2016), a software package for text analysis in R, is another tool that will be used in the introductory sessions.

3. Data

Participants will practise with data provided by the organizers to better familiarize themselves with the software. The texts under analysis will be the seven English language Harry Potter novels by J. K. Rowling (the so-called core canon of the fandom), a large corpus selection of Harry Potter fanfiction (harvested from Archive of Our Own) as well as the Harry Potter movie subtitles.

4. Objectives

The first objective of the tutorial is to introduce participants to two popular applications of text analysis that tie in closely with intertextuality studies, providing them with an understanding of some of the challenges, methods and strategies proper to this area of research. To this end, we use the illustrative Rowling case study to identify which proportion of the original novels and how much of their style the movies and fanfiction both retain. Additionally, the tutorial seeks to equip participants with the necessary knowledge to independently use the demonstrated software at home (and on their own corpora). Finally, it introduces visualization techniques to display results in an intuitive fashion, provoking new hermeneutic questions.

Appendix A

  1. Bloom, H. (1973). The Anxiety of Influence: A Theory of Poetry. Oxford, New York: Oxford University Press.
  2. Eder, M., Rybicki, J., Kestemont, M. (2016). Stylometry with R: A Package for Computational Text Analysis. The R Journal, 8: 107–121.
  3. Juola, P. (2013). Rowling and “Galbraith”: an authorial analysis. Language Log. (accessed 2 May 2018).
  4. Milli, S., Bamman, D. (2016). Beyond Canonical Texts: A Computational Analysis of Fanfiction. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2048–2053. .
  5. Orr, M. (2003). Intertextuality: Debates and Contexts. Polity.
  6. Westman, K.E. (2007). Perspective, Memory, and Moral Authority: The Legacy of Jane Austen in J. K. Rowling’s Harry Potter. Children’s Literature, 35: 145–165. .