Rules against the Machine: Building Bridges from Text to Metadata
1.
Introduction
Digital literary studies advance in their research, requiring more specific metadata about literary phenomena: narrator (Hoover 2004), characters (Kastorp et al. 2015), place and period, etcetera. This metadata can be used to explain results in tasks like authorship attribution or genre detection, or to evaluate digital methods (Calvo Tello 2017). What could be the most efficient way to start annotating this information in corpora of thousand of texts in languages, genres and historical periods for which many NLP tools are not trained for? In this proposal, the aim is to identify specific literary metadata about entire texts with methods that are either language-independent or easily adaptable for humanists.
2.
Two Ways from Text to Metadata
The two approaches to classify unlabeled samples applied here are rule-based classification and supervised machine learning. In rule-based classification (Witten et al. 2011), domain experts define formalised rules that correctly classify the samples. For example a rule based on a single token can be defined for each class to predict whether a text is written in third person (83% of the corpus) or first person using tokens for the two values are the Spanish words
dije (‘I said’) and
dijo (‘he said’), and the rule:
- if
dijo appears 90% more than
dije, the novel is written in third person - if
dijo appears less, in first person
The results of applying this rule can be presented as a confusion matrix:
Fig 1. Confusion Matrix of rule-based results about narrator
For supervised methods (Müller and Guido 2016; VanderPlas 2016), we need labeled samples to train and evaluate the method. In the following table, the different classifiers and document-representations achieve different accuracy scores:
raw | relative | tfidf | zscores | |
SVC | 0.90 | 0.83 | 0.83 | 0.88 |
KNN | 0.83 | 0.88 | 0.81 | 0.81 |
RF | 0.88 | 0.88 | 0.86 | 0.90 |
DT | 0.84 | 0.83 | 0.84 | 0.82 |
LR | 0.88 | 0.83 | 0.83 | 0.17 |
BN | 0.72 | 0.72 | 0.72 | 0.82 |
GN | 0.72 | 0.80 | 0.80 | 0.81 |
Fig 2. Accuracy (F1-score) for narrator
3.
Corpus and Metadata
The data is part of the
Corpus of Spanish Novels of the Silver Age (1880-1939) (used in Calvo Tello et al. 2017), with 350 novels in XML-TEI by 58 authors. Each text has been annotated manually with metadata and its degree of certainty has been assigned. 262 texts with either high or medium certainty have been used to create a gold-standard with the following classes:
- protagonist.gender
- protagonist.age
- protagonist.socLevel
- setting.type
- setting.continent
- setting.country
- setting.name
- narrator
- representation
- time.period
- end
4.
Modelisation and Methods
The scripts have been written in Python (available on GitHub)
1
. The features have been represented as different document models (Kestemont et al. 2016):
- raw frequencies
- relative frequencies
- tf-idf
- z-scores
Different classify algorithms (cross validation, 10 folds) and amount of Most Frequent Words have been evaluated. For each class a single token was used to represent each class value and a ratio was assigned for the default class value (see repository in GitHub for rules). Both approaches were compared to a “most populated class” baseline, quite high in many cases.
5.
Results
The results of both approaches are as following:
Class | F1 baseline | F1 Rule | F1 Cross Mean | F1 Cross Std | Algorithm | Model | MFW | Winner |
end | 0.60 | 0.54 | 0.60 | 0.02 | LR | tfidf | 100 | Baseline |
narrator | 0.83 | 0.80 | 0.91 | 0.04 | RF | tfidf | 1000 | ML |
protagonist.age | 0.55 | 0.25 | 0.55 | 0.01 | LR | tfidf | 100 | Baseline |
protagonist.gender | 0.80 | 0.68 | 0.80 | 0.01 | BN | tfidf | 100 | Baseline |
protagonist.socLevel | 0.63 | 0.49 | 0.64 | 0.07 | SVC | zscores | 5000 | Baseline |
representation | 0.88 | 0.80 | 0.88 | 0.01 | LR | tfidf | 100 | Baseline |
setting.continent | 0.95 | 0.94 | 0.96 | 0.01 | SVC | zscores | 5000 | Baseline |
setting.continent.binar | 0.95 | 0.95 | 0.95 | 0.19 | LR | zscores | 500 | Baseline |
setting.country | 0.93 | 0.38 | 0.94 | 0.01 | SVC | zscores | 1000 | Baseline |
setting.country.binar | 0.87 | 0.47 | 0.88 | 0.03 | SVC | zscores | 1000 | Baseline |
setting.name | 0.64 | 0.85 | 0.71 | 0.02 | SVC | zscores | 1000 | Rule |
setting.type | 0.48 | 0.46 | 0.71 | 0.05 | SVC | zscores | 5000 | ML |
time.period | 0.95 | 0.95 | 0.97 | 0.01 | BN | zscores | 5000 | Baseline |
Fig 3. Results
In many cases the baselines are higher than the results of both approaches. The rule outperformed the baseline in the case of name of the setting with very good results. In two cases (narrator and setting’s type), Machine Learning is the most successful approach and its F1 is statistically higher than the baseline (one sample t-test, ɑ = 5%). The algorithms Supported Vector Machines, Logistic Regression and Random Forest are most successful, while tf-idf and speacilly z-scores got the best results, the last one a data representation “highly uncommon in other applications” different from stylometry (Kestemont et al, 2016).
6.
Conclusions
In this proposal I have used simple rules and simple features in order to detect relatively complex literary metadata in many cases with high baselines. While Machine Learning showed a statistically significant improvement in detection for two classes (type of setting and narrator), rules worked better for the name of the setting. This is a promising point to continue researching in order to annotate the rest of the corpus.
Appendix A
- Calvo Tello, J. (2017). What does Delta see inside the Author?: Evaluating Stylometric Clusters with Literary Metadata. III Congreso de La Sociedad Internacional Humanidades Digitales Hispánicas: Sociedades, Políticas, Saberes. Málaga: HDH, pp. 153–61 <
http://hdh2017.es/wp-content/uploads/2017/10/Actas-HDH2017.pdf
>. - Calvo Tello, J., Schlör, D., Henny-Krahmer, U. and Schöch, C. (2017). Neutralising the Authorial Signal in Delta by Penalization: Stylometric Clustering of Genre in Spanish Novels. Montréal: ADHO, pp. 181–83 <
https://dh2017.adho.org/abstracts/037/037.pdf
>. - Hoover, D. L. (2004). Testing Burrows’s Delta. Literary and Linguistic Computing, 19(4): 453–75.
- Kastorp, F., Kestemont, M., Schöch, C. and Bosch, A. Van den (2015). The Love Equation: Computational Modeling of Romantic Relationships in French Classical Drama. Sixth International Workshop on Computational Models of Narrative. Atlanta, GA, USA. <https://zenodo.org/record/18343>.
- Kestemont, M., Stover, J., Koppel, M., Karsdorp, F. and Daelemans, W. (2016). Authenticating the writings of Julius Caesar. Expert Systems with Applications, 63: 86–96 <http://dx.doi.org/10.1016/j.eswa.2016.06.029>.
- Müller, A. C. and Guido, S. (2016). Introduction to Machine Learning with Python: A Guide for Data Scientist. Beijing: O’Reilly.
- VanderPlas, J. (2016). Python Data Science Handbook: Essential Tools for Working with Data. First edition. Beijing Boston Farnham: O’Reilly.
- Witten, I., Frank, E. and Hall, M. (2011). Data Mining: Practical Machine Learning Tools and Techniques. 3rd edition. San Francisco: Morgan Kaufmann.