Using Topic Modelling to Explore Authors’ Research Fields in a Corpus of Historical Scientific English

Stefan Fischer (stefan.fischer@uni-saarland.de), Universität des Saarlandes, Germany und Jörg Knappen (j.knappen@mx.uni-saarland.de), Universität des Saarlandes, Germany und Elke Teich (e.teich@mx.uni-saarland.de), Universität des Saarlandes, Germany

1. Introduction

In the digital humanities, topic models are a widely applied text mining method (Meeks and Weingart, 2012). While their use for mining literary texts is not entirely straightforward (Schmidt, 2012), there is ample evidence for their use on factual text (e.g. Au Yeung and Jatowt, 2011; Thompson et al., 2016). We present an approach for exploring the research fields of selected authors in a corpus of late modern scientific English by topic modelling, looking at the topics assigned to an author’s texts over the author’s lifetime. Areas of applications we target are history of science, where we may be interested in the evolution of scientific disciplines over time (Thompson et al., 2016; Fankhauser et al., 2016), or diachronic linguistics, where we may be interested in the formation of languages for specific purposes (LSP) or specific scientific “styles” (cf. Bazerman, 1988; Degaetano-Ortlieb and Teich, 2016).

We use the Royal Society Corpus (RSC, Kermes et al., 2016), which is based on the first two centuries (1665–1869) of the Philosophical Transactions and the Proceedings of the Royal Society of London. The corpus contains 9,779 texts (32 million tokens) and is available at https://fedora.clarin-d.uni-saarland.de/rsc/ . As we are interested in the development of individual authors, we focus on the single-author texts (81%) of the corpus. In total, 2,752 names are annotated in the single-author papers, but the activity of authors varies. Figure 1 shows that a small group of authors wrote a large portion of the texts. In fact, the twelve authors used for our analysis wrote 11% of the single-author articles.

 Productivity of writers of single-author papers
Figure 1. Productivity of writers of single-author papers

2. Approach

The topic modelling approach we take as a basis is Latent Dirichlet Allocation (LDA, Blei et al., 2003). LDA assumes that corpora contain a number of recurring topics and it treats texts as bags of words. Topics, which can be regarded as groups of semantically related words, are represented as probability distributions over words and each text is treated as a mixture of topics. Typically, topics are displayed as lists of the most probable words and labels are assigned manually. We also considered author-topic models (Rosen-Zvi et al., 2010) but their author-topic matrix implies that authors’ topics are fixed over time.

As disciplines were not part of the original metadata of the RSC, we applied topic modelling to approximate disciplines. Using MALLET (McCallum, 2002), we built a model with 24 topics, which are shown along with their characteristic words in Figure 2.

 Topic labels and top words
Figure 2. Topic labels and top words

Following the approach of Fankhauser et al. (2016), we clustered the topics using Jensen–Shannon divergence. Figure 3 shows the resulting topic hierarchy. Based on this clustering, we identified broader research areas, which we marked on the branches of the dendrogram.

 Hierarchical clustering of the 24 topics
Figure 3. Hierarchical clustering of the 24 topics

3. Results

Using these broader categories, we explore whether individual authors stayed in the same area or shifted their focus during their time of scientific production. For this purpose, we selected the most prolific authors (29–198 articles) in the RSC and tracked their topics over time (see Figures 4 and 5). We excluded names if we could not identify the author in the Virtual International Authority File or if publication years did not match the author’s lifetime.

 Comparison of topics of most prolific authors
Figure 4. Comparison of topics of most prolific authors

Figure 4 shows the topics used by twelve authors during their career. We can see two groups of authors. Authors like Arthur Cayley dedicated their life to a single research area whereas Humphry Davy worked on two topics or in an interdisciplinary area. Figure 5 shows the development of the same authors over time. Overall, the authors’ interests did not change dramatically over their professional life. However, one can identify a peak of productivity for most authors.

 Development of individual authors over time
Figure 5. Development of individual authors over time

4. Conclusion

We proposed to use topic modelling as a method of exploring the development of the scientific orientation of individual authors over time. Taking topic as an approximation of discipline, our approach can be used to explore the contribution of a particular author to a given discipline over time or find authors with potentially interesting production profiles (e.g. authors shifting topics). In our future work, we will improve our models (e.g. avoid potential confusion of namesakes) by better metadata on the authors which we will obtain from the Royal Society.

5. Acknowledgement

We acknowledge the support of DFG (Deutsche Forschungsgemeinschaft) through the Cluster of Excellence Multimodal Computing and Interaction (MMCI).


Appendix A

Bibliography
  1. Au Yeung, C. and Jatowt, A. (2011). Studying How the Past is Remembered: Towards Computational History Through Large Scale Text Mining. Proceedings of the 20th ACM International Conference on Information and Knowledge Management. (CIKM ’11). Glasgow, Scotland, UK: ACM, pp. 1231–1240.
  2. Bazerman, C. (1988). Shaping Written Knowledge: The Genre and Activity of the Experimental Article in Science. Madison: University of Wisconsin Press.
  3. Blei, D. M., Ng, A. Y. and Jordan, M. I. (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research, 3: 993–1022.
  4. Degaetano-Ortlieb, S. and Teich, E. (2016). Information-based Modeling of Diachronic Linguistic Change: from Typicality to Productivity. Proceedings of the 10th SIGHUM Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities. Berlin, Germany: Association for Computational Linguistics, pp. 165–173.
  5. Fankhauser, P., Knappen, J. and Teich, E. (2016). Topical Diversification over Time in the Royal Society Corpus. Digital Humanities 2016: Conference Abstracts. Kraków: Jagiellonian University & Pedagogical University, pp. 496–500.
  6. Kermes, H., Degaetano-Ortlieb, S., Khamis, A., Knappen, J. and Teich, E. (2016). The Royal Society Corpus: From Uncharted Data to Corpus. Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016). Portorož, Slovenia: European Language Resources Association (ELRA).
  7. McCallum, A. K. (2002). MALLET: A Machine Learning for Language Toolkit. http://mallet.cs.umass.edu (accessed 1 April 2018).
  8. Meeks, E. and Weingart, S. B. (2012). The Digital Humanities Contribution to Topic Modeling. Journal of Digital Humanities, 2(1): 1–6.
  9. Rosen-Zvi, M., Chemudugunta, C., Griffiths, T., Smyth, P. and Steyvers, M. (2010). Learning Author-topic Models from Text Corpora. ACM Transactions on Information Systems (TOIS), 28(1): 4:1–4:38.
  10. Schmidt, B. M. (2012). Words Alone: Dismantling Topic Models in the Humanities. Journal of Digital Humanities, 2(1): 49–65.
  11. Thompson, P., Batista-Navarro, R. T., Kontonatsios, G., Carter, J., Toon, E., McNaught, J., Timmermann, C., Worboys, M. and Ananiadou, S. (2016). Text Mining the History of Medicine. PLOS ONE, 11(1): 1–33.