Spotting the Character: How to Collect Elements of Characterisation in Literary Texts?
What is a literary character made of? To this question, a pragmatic answer is to say that it exists as a result of a chain of different linguistic elements, scattered throughout the text. The aim of this paper is to propose a digital method for collecting these elements, so as to analyse their nature, to observe their repartition in texts, and, ultimately, to contribute to a deeper understanding of the functions the literary device called “character” assumes in a text.
Projects dedicated to named-entity recognition put a great deal of effort into using Natural Language Processing (NLP) techniques for identifying names of people, places and organisations mentioned in various types of discourses, especially political ones, as well as the co-referential chains built on the basis of these names. However, in spite of important advances in the field, much remains to be done in order to train the computer to link correctly various phrases referring to the same entity, as well the pronouns pointing to it (see Schnedeker and Landragin, 2014). In our case, we are interested in such elements of a co-referential chain that bear characterization features, and this is, inevitably, a supplementary complication. In addition, we are interested in certain elements (eg. “his brother” in the phrase “John is his brother”) that are often left aside in named entity recognition, as performing another functions than strictly pointing towards an entity. Therefore, NLP techniques did not appear adapted to our needs.
We will therefore resort to “crowd-reading”, as another means, offered by the explosion of the digital sphere, to make sense from texts. Very similar to the crowdsourcing, the crowd-reading asks to benevolent contributors to annotate a document, bringing in their own view and understanding, instead of transcribing, or adding in information based on a (sometimes external) form of authority. Considering the nature of the work to be done, the crowd-reading appeared as a valid technique in our case.
In a first stage, we submitted a short text (Julio Cortazar “Continuidad de las parques”) to the manual annotation of a hundred students from our universities. This brought to the fore the sheer variety of elements considered to be participating to the characterization of a literary “person” (nouns and adjectives, of course, but verbs and adverbs too), as well as the need to dispose of a controlled vocabulary allowing to understand what kind of characterization each respondent attached to the various strings of characters selected as participating to this function.
In a second phase, we have decided to build an interface, offering a more ergonomic experience to our respondents, and allowing us to extract automatically the linguistic elements selected, as well as to group them by categories. Built with XML Mind, this interface is in fact based on a text lightly encoded with TEI tags, in which our respondents add, every time they select a string of characters, an
a @key attribute, defined by each respondent every time he or she encounters a new character. The keys are subsequently available for reuse in the rest of the text. We expect the number of keys to vary considerably from a reader to another.
an @ana attribute, with a set of constrained values. Based on another project of character analysis, these values have been defined in Galleron, 2017, and cover aspects such as the ontological type of a character, its sex, age, family position, nationality, occupation, and so on.
The text submitted to annotation has been changed for this second experience: it concerns now the “Jardin aux sentiers qui bifurquent” (“Jardín de los senderos que se bifurcan”) by Jorge Luis Borges. At the date of this proposal, the second campaign of crowd-reading has not started yet. We'll have a significant number of answers before the DH conference. Our respondents will be recruited again amongst the students enrolled in literary studies in our universities: while they have a certain level of training in linguistics, literature and poetics, so as to be able to recognise the type of linguistic elements we look for, their reading still remains close of the “non-informed”, “amateur” reading of the "man in the street" (see Dufays, 205).
The results will be analysed so as to observe what kind of linguistic units have been identified most often, and what kind of values of the @ana attribute have been mobilised most often. We will further discuss the divergences between the selected elements, and those we were expecting to be selected. This will allow us, on the one hand, to suggest a possible use of our interface as a remediation tool in literary studies, for students with difficulties in extracting pertinent information from a text, so as to respond a specific task. On the other hand, we will advance an hypothesis about the observed distribution of the most frequent elements of characterization, that are far to appear where, intuitively, one would expect them to be grouped together (so as to “introduce” the character) as shown by our first campaign of crowd-reading, and by our own annotation endeavours.
- Dufays, Jean-Louis, Gemenne, Louis, et Ledur, Dominique (2005). Pour une lecture littéraire. Histoire, théories, pistes pour la classe, Bruxelles: De Boeck – Duculot.
- Galleron, Ioana (2017). Conceptualisation of theatrical characters in the digital paradigm: needs, problems and foreseen solutions. Human and Social studies, De Gruyter. 6: 1 ( Published Online : 2017-04-18 | DOI: https://doi.org/10.1515/hssr-2017-0007).
- Schenedeker, Catherine; Landragin, Frédéric (2014). Les chaînes de référence: présentation. Langages, 3:145, 3-22.