Handwritten Text Recognition, Keyword Indexing, and Plain Text Search in Medieval Manuscripts

Dominique Stutzmann (dominique.stutzmann@irht.cnrs.fr), Institut de Recherche et d'Histoire des Textes (CNRS), France and Christopher Kermorvant (kermorvant@teklia.com), Teklia, France and Enrique Vidal (evidal@prhlt.upv.es), Universitat Politècnica de València, (Spain) and Sukalpa Chanda (s.chanda@rug.nl), Rijksuniversiteit Groningen (The Netherlands) and Sébastien Hamel (sebastien.hamel@irht.cnrs.fr), Institut de Recherche et d'Histoire des Textes (CNRS), France and Joan Puigcerver Pérez (joapuipe@upv.es), Universitat Politècnica de València, (Spain) and Lambert Schomaker (l.r.b.schomaker@rug.nl), Rijksuniversiteit Groningen (The Netherlands) and Alejandro H. Toselli (ahector@prhlt.upv.es), Universitat Politècnica de València, (Spain)

1. Introduction

Artificial Intelligence has unlocked the access to the text of medieval manuscripts! The partners of the European research project HIMANIS implemented, for the first time, the indexing and plain text querying of thousands of pages of medieval manuscripts. The large scale of the corpus and the possibility to search in plain text for handwritten sources are unheard of in medieval studies, so that the results present a major shift for historians. The challenge of multilingualism, script variation and abbreviations, which are crucial for HTR on medieval sources, has been successfully met.

2. Context

Millions of medieval manuscripts, charters and archival documents are preserved worldwide, and centuries of scholarship and text editions could naturally not exhaust the wealth of these resources. Digital libraries ( BVMM, Gallica, e-Codices, Manuscripta Mediaevalia, etc.) and archives ( Monasterium) are amassing reproductions of medieval manuscripts and archives, often with scarce metadata. However, while Optical Character Recognition technologies allow to easily “distant read” several millions of books (Moretti, 2013; Crane, 2006; Clement et al., 2008; GDELT Project, 2015), medieval manuscripts and archives remain difficult to access, read and understand. Handwritten Text Recogni tion (HTR) systems cannot offer sufficiently accurate transcripts on historical documents. Therefore continuous efforts are made in Europe. After tranScriptorium (McNicholl and Miles-Board, 2015) , the EU has funded HIMANIS and also funds Recognition and Enrichment of Archival Documents (READ, 2016-2019) under the H2020 program, with few medieval sources. MONK , developed by the University of Groningen, is also a well-known infrastructure, including some medieval resources as those from Stadsarchiev Leuven.

3. HIMANIS: consortium, corpus, method, and evaluation

Handwritten Text Recognition (HTR) is the focus of the European cross-disciplinary research project HIMANIS (Historical MANuscript Indexing for user-controlled Search), funded by the JPI Cultural Heritage. The partners applied HTR technologies for multilingual medieval manuscripts and demonstrated the feasibility of an accurate and meaningful automated text indexing of large collections of hitherto untranscribed text images .

The partners build a cross-disciplinary consortium. The principal investigator is a researcher in the Humanities (Institut de Recherche et d’Histoire des Textes, CNRS) and the project gathered several research teams in engineering sciences, both in the private and public sectors : A2iA (France), University of Groningen (The Netherlands) and Polytechnic University of Valencia (UPVLC, Spain). Cultural Heritage institutions provided support and datasets: Archives Nationales (France), Bibliothèque nationale de France. UPVLC and University of Groningen are partners in or host institutions for the above mentioned READ and MONK developments.

As a challenging and particularly interesting corpus, the partners chose the large collection of registers and formularies produced by the French royal chancery in the 14 th and 15 th c., encompassing 199 volumes, representing 83’000 pages, with 64’830 royal charters in 175 registers, and 24 formularies and related resources. This large and iconic collection bears witness to the rationalization of late medieval administration and is a key source to our understanding of medieval Europe and the rise of centralized nation states on the continent as consequence of the long lasting wars between France and England. While HTR on medieval sources is notoriously highly difficult given the greatly variable handwriting styles, this corpus is even more challenging because of its multilingual content and the large number of abbreviated words.

A first work package consisted in creating the corpus, formatting available metadata, and authority data. The Archives Nationales digitized the corpus in several batches during the project. The metadata on French chancery registers are diverse. In increasing order of information quality, there are: (1.) medieval tables of content copied in autonomous inventories in the 18 th and 19 th c.; (2.) index cards with reference to shelfmarks (and rarely folio number) containing person and place names; (3a.) printed systematic inventories, including some already converted to EAD without their indexes, as well as (3b.) handwritten systematic inventories which were only accessible in situ, and (3c.) printed geographic or thematic inventories; (4.) partial, rarely scholarly, editions. The partners devised an integrated TEI format to accommodate all four types of metadata, and converted all metadata to this format, including the handwritten inventories on which the partners applied HTR-technologies to recognize the text abstracts and the index entries (Stutzmann et al., 2017) . In total, after more than 150 years of systematic research, inventories only covered 28’000 charters, that is ca. 43% of the register corpus and only one formulary was edited (Odart Morchesne, 2005; Guyotjeannin and Lusignan, 2011) . Authority data encompass linguistic dictionaries and gazetteers, which were used to produce a lemmatized search engine.

A second work package consisted in training and applying a robust “optical model”, capable of dealing with the variability and abbreviations in medieval, multilingual handwritings. The existing editions were first “aligned” on the available “text images” at a line level, applying techniques developed by the partners in the project Oriflamms (Leydier et al., 2014; Stutzmann et al., 2015; Bluche et al., 2016; Oriflamms, 2017) . Basing on this alignment and using deep neural networks (CNN/RNN), machines could learn to “read” (Bluche et al., 2017) . Learning on the monumental, modernizing, and very regularizing edition by P. Guérin (normalized punctuation, expanded abbreviations) (Guérin and Celier, 1881) , the system created so-called “character lattices” which included abbreviations, so that the system was also able to read and expand abbreviations (fig. 1).

Figure 1: Text indexing. The query “[Saint Omer]” retrieves both abbreviated and unabbreviated strings in different volumes and different handwritings. Images: Paris, Archives Nationales, JJ 35 and JJ 164.
Figure 1. Figure 1: Text indexing. The query “[Saint Omer]” retrieves both abbreviated and unabbreviated strings in different volumes and different handwritings. Images: Paris, Archives Nationales, JJ 35 and JJ 164.

The decoding process produces different “hypotheses” for each spot on the image (typically from one to ten variant readings) and rated them according to their confidence levels according to inner statistical models and linguistic ones. The index has been “pruned” (i.e. reduced by removing) from the most unlikely readings, but still contains more than 28 bn index entries, 3 bn lines, 44 bn “words” and “pseudo-words”.

The search engine was published online as a beta version ( http://prhlt-kws.prhlt.upv.es/himanis/ ) and is being transferred to https://himanis.org/ where images and texts are accessible through the IIIF protocol and as IIIF annotations (fig. 2).

Figure 2: Indexed words as IIIF annotation in the IIIF compliant viewer Mirador (image: Paris, Archives Nationales, JJ 137, page 14)
Figure 2. Figure 2: Indexed words as IIIF annotation in the IIIF compliant viewer Mirador (image: Paris, Archives Nationales, JJ 137, page 14)

Like in the tranScriptorium model, the users can set the confidence level for the search to reduce the noise or maximize the hit list value, a functionality that is the information science equivalent to the performance measure in computer science through “precision” (number of correct occurrences in the hitlist) and “recall” (number of correct occurrences compared to an error-free edition).

Qualitative and quantitative measures demonstrates that the HIMANIS system has obtained a very high level of precision, more than 85%, even increased to 99% through lemmatization (Stutzmann, 2017a-c) (fig. 3).

Figure 3: Measuring the performance of text indexing (Recall/Precision Evaluation)
Figure 3. Figure 3: Measuring the performance of text indexing (Recall/Precision Evaluation)

4. Additional challenges: writer identification, granularity, crowdsourcing…

In parallel to HTR, another focus was automated writer identification. It allows a preliminary, nonetheless novel, analysis on the organization of the French chancery. These are among the first measured and convincing results produced for medieval handwritings, not represented in international competitions (Fiel et al., 2017; Andreu Sánchez et al., 2017) . Based on the Quill feature (Brink et al., 2012) and validated on a partial ground-truth established by a paleographer, the system clustered hitherto unstudied page images, attributing them to 204 hypothetical writers. Fig. 4 illustrates the calculated presence of writers in each volume, giving a first insight into possible collaborations between scribes within the chancery across time.

Figure 4: Writer Identification: Visualization of the different “hands” and their likeliness in the different volumes (representation of Gaussian standard deviation for style clustering, with 204 clusters)
Figure 4. Figure 4: Writer Identification: Visualization of the different “hands” and their likeliness in the different volumes (representation of Gaussian standard deviation for style clustering, with 204 clusters)

The partners tackle additional challenges. The text and structure of registers containing multiple charters impose to combine different granularities and intertwined both physical (page) and intellectual levels (one/several charter(s) on one/several page(s)). The integration of authority data and gazetteers allows new access, but with possible errors in text indexing and identification of named entities, measuring the applicability and usefulness of text indexing is an important methodological new task. Crowdsourcing results are currently negatively biased, because of the implemented ergonomics and users’ strategies. They tend to “suggest corrections” in order to improve their future search experience, rather than to validate correct spots (fig. 4). Nevertheless, it helps measuring impact, adequacy, precision, and usefulness.

Figure 5: Crowdsourcing as biased feedback
Figure 5. Figure 5: Crowdsourcing as biased feedback

5. Perspectives

The results of HTR are setting a new standard and make the digitized images a new source. Yet, they must obviously not be mistaken for a scholarly edition. HIMANIS participated in the current trend of Digital Humanities and uses images as data, both for text and for writer identification (Kestemont et al., 2017) . Deep indexing represents new challenge for “distant reading” addressing topics and charter contents, because there is not an even number of hypotheses for all image spots. In new funded projects HOME (History Of Medieval Europe) and HORAE (Hours: Recognition Analysis, Editions), the partners are working on a methodology to create truthful and trustworthy results from uncertain and uneven, automatically generated data.


Appendix A

Bibliography
  1. Andreu Sánchez, J., Romero, V., Toselli, A. H., Villegas, M. and Vidal, E. (2017). ICDAR2017 Competition on Handwritten Text Recognition on the READ Dataset. 14th IAPR International Conference on Document Analysis and Recognition. ICDAR 2017. Kyoto: CPS, pp. 1383–88 doi:DOI 10.1109/ICDAR.2017.226.
  2. Bluche, T., Hamel, S., Kermorvant, C., Puigcerver, J., Stutzmann, D., Toselli, A. H. and Vidal, E. (2017). Preparatory KWS Experiments for Large-Scale Indexing of a Vast Medieval Manuscript Collection in the HIMANIS Project. 14th IAPR International Conference on Document Analysis and Recognition. ICDAR 2017. Kyoto: CPS, pp. 312–17 doi:DOI 10.1109/ICDAR.2017.59.
  3. Bluche, T., Stutzmann, D. and Kermorvant, C. (2016). Automatic Handwritten Character Segmentation for Paleographical Character Shape Analysis. 2016 12th IAPR Workshop on Document Analysis Systems (DAS). pp. 42–47 doi:10.1109/DAS.2016.74.
  4. Brink, A. A., Smit, J., Bulacu, M. L. and Schomaker, L. R. B. (2012). Writer identification using directional ink-trace width measurements. Pattern Recognition, 45(1): 162–71 doi:10.1016/j.patcog.2011.07.005.
  5. Clement, T., Steger, S., Unsworth, J. and Uszkalo, K. (2008). How Not To Read A Million Books Harvard University, Cambridge, MA http://www.people.virginia.edu/~jmu2m/hownot2read.html (accessed 25 March 2017).
  6. Crane, G. (2006). What Do You Do with a Million Books?. D-Lib Magazine, 12(3) doi:10.1045/march2006-crane. http://www.dlib.org/dlib/march06/crane/03crane.html (accessed 25 March 2017).
  7. European Commission (2016). CORDIS : Projects & Results Service : Recognition and Enrichment of Archival Documents CORDIS Community Research and Development Information Service http://cordis.europa.eu/project/rcn/198756_en.html (accessed 14 October 2016).
  8. Fiel, S., Kleber, F., Diem, M., Christlein, V., Louloudis, G., Stamatopoulos, N. and Gatos, B. (2017). ICDAR2017 Competition on Historical Document Writer Identification (Historical-WI). 14th IAPR International Conference on Document Analysis and Recognition. ICDAR 2017. Kyoto: CPS, pp. 1377–82 doi:DOI 10.1109/ICDAR.2017.225.
  9. GDELT Project (2015). Google BigQuery + 3.5M Books: Sample Queries GDELT Blog https://blog.gdeltproject.org/google-bigquery-3-5m-books-sample-queries/ (accessed 23 November 2017).
  10. Guérin, P. and Celier, L. (1881). Recueil des documents concernant le Poitou contenus dans les registres de la chancellerie de France. 14 vols. (Archives historiques du Poitou). Poitiers: Société des archives historiques du Poitou http://gallica.bnf.fr/ark:/12148/bpt6k209478j (accessed 25 April 2014).
  11. Guyotjeannin, O. and Lusignan, S. (2011). Introduction au formulaire d’Odart Morchesne Le Formulaire d’Odart de Morchesne http://elec.delisle.enc.sorbonne.fr/morchesne/ (accessed 23 November 2017).
  12. Kestemont, M., Christlein, V. and Stutzmann, D. (2017). Artificial Paleography: Computational Approaches to Identifying Script Types in Medieval Manuscripts. Speculum, 92(S1): S86–109 doi:10.1086/694112.
  13. Leydier, Y., Eglin, V., Bres, S. and Stutzmann, D. (2014). Learning-Free Text-Image Alignment for Medieval Manuscripts. 2014 14th International Conference on Frontiers in Handwriting Recognition (ICFHR). pp. 363–68 doi:10.1109/ICFHR.2014.67.
  14. McNicholl, R. and Miles-Board, T. (2015). tranScriptorium : Computer-Aided, Crowd-Sourced Transcription of Handwritten Text (for Repositories). 10th International Conference on Open Repositories (OR2015).
  15. Moretti, F. (2013). Distant Reading. London ; New York: Verso.
  16. Odart Morchesne (2005). Le Formulaire d’Odart Morchesne : Dans La Version Du Ms BnF Fr. 5024. (Ed.) Guyotjeannin, O. & Lusignan, S. (Mémoires et Documents de l’École Des Chartes 80). Paris: École des chartes.
  17. Oriflamms, C. (2017). Compte-rendu final du projet ORIFLAMMS / ORIFLAMMS Final report Billet Écriture Médiévale & Numérique | Écritures Médiévales et Lecture Numérique. Carnet Du Projet ORIFLAMMS (Ontology Research, Image Features, Letterform Analysis on Multilingual Medieval Scripts) http://oriflamms.hypotheses.org/1592 (accessed 25 May 2017).
  18. Stutzmann, D. (2017a). 99 pilgrimages / 99 ‘pelerinage(s)’: accuracy and historical research Billet Himanis https://himanis.hypotheses.org/171 (accessed 23 November 2017).
  19. Stutzmann, D. (2017b). Saint-Omer in the registers of the French royal chancery (99.6% precision!) Billet Himanis https://himanis.hypotheses.org/195 (accessed 23 November 2017).
  20. Stutzmann, D. (2017c). The Royal Highness / L’altesse royale / regalis celsitudo : une thématique de préambule pour mesurer l’utilité d’HIMANIS Billet Himanis https://himanis.hypotheses.org/246 (accessed 23 November 2017).
  21. Stutzmann, D., Bluche, T., Lavrentiev, A., Leydier, Y. and Kermorvant, C. (2015). From Text and Image to Historical Resource: Text-Image Alignment for Digital Humanists. Sydney http://dh2015.org/abstracts/xml/STUTZMANN_Dominique_From_Text_and_Image_to_Histor/STUTZMANN_Dominique_From_Text_and_Image_to_Historical_R.html (accessed 29 June 2015).
  22. Stutzmann, D., Moufflet, J.-F. and Hamel, S. (2017). La recherche en plein texte dans les sources manuscrites médiévales : enjeux et perspectives du projet HIMANIS pour l’édition électronique = Full Text Search in Medieval Manuscripts : Issues and Perspectives of the HIMANIS Project for Electronic Publishing. Médiévales : Langue, Textes, Histoire, 73.