OCR’ing and classifying Jean Desmet’s business archive: methodological implications and new directions for media historical research
This paper discusses the endeavours of the research project MIMEHIST: Annotating EYE’s Jean Desmet Collection (2017-2018) - funded by the Netherlands Scientific Research Organisation - to do optical character recognition (OCR) and apply various computer vision techniques on the business archive of film distributor and exhibitor Jean Desmet (1875-1956) .
The Desmet collection consists of approximately 950 films produced between 1907 and 1916, a business archive containing around 127.000 documents, some 1050 posters and around 1500 photos. The Collection is unique because of its large amount of rare films from the transitional years of silent cinema, and because of the richness of its business archive which holds extensive documentation of early film exhibition and distribution practices in the 1910s. These features contribute to its immense historical value which was one of the main reasons why it was inscribed on UNESCO’s Memory of the World Register in 2011.
By OCRing and classifying Jean Desmet’s business archive, MIMEHIST will allow scholars to browse and annotate its documents - all scanned in high resolution - in the new ‘Media Suite’ of the Dutch national research infrastructure (CLARIAH). The results will be integrated in a search interface enabling media historians to identify word frequencies and topics as a basis for research on early film distribution and exhibition and, the paper argues, open for media historical research which productively builds on and expands the collection’s use in previous scholarship.
Throughout the past decades, Desmet’s business documents have offered a rich source for socio-economic cinema history. Media historians such as Karel Dibbets and Rixt Jonkman have studied parts of the collection’s (related) data by manually transcribing and organising it into databases (Jonkman, 2007; Dibbets, 2010). This work produced an empirical, quantitative foundation for network analysis of Dutch film distribution and exhibition in cinema’s earliest years. However, this research also made evident that the archive is too large and diverse to organise and transcribe manually. A particular challenge is that collection contains many different kinds of documents: personal letters, business letters, records of film rentals, postcards, newspaper clippings, telegrams, scraps of paper with notes, photographs etc. Furthermore, some documents are typewritten, others handwritten.
To allow scholars to research and annotate larger amounts of the archival documents’ data in CLARIAH’s Media Suite , automated information extraction from the documents seemed challenging yet promising. MIMEHIST took up this challenge by trying OCR, document classification, topic modelling, named entity recognition and other visual and linguistic tools on the set of scans in order to extract as much data and metadata from the individual documents as possible. Different document types required different treatment. For instance, we quickly determined that it did not make much sense to do OCR on a tiny handwritten note, while handwriting detection on the other hand would be possible and could yield productive results on such an item.
Experiments were conducted in visual document classification, visual document analysis and distant reading. Visual document classification was performed by clustering a combination of color and texture histograms derived from the scans. This step was taken mostly because the existing index of the archive is incomplete: it has information on the folders in the archive, which contain the documents, but not the documents themselves. The Media Suite works with individual documents, not folders, so it became necessary to, for instance, discern sub-folder covers from the documents inside.
A second reason to do classification was that each type of document needs a different kind of processing - typed letters can be OCR-ed, but not photos, while handwritten letters could be classified by comparing handwriting styles. By separating different document types it became possible to employ the most effective information extraction tools on them. This procedure also allowed for finding visually similar documents, making it possible for researchers to look for similarities in for instance texture or color.
The typewritten documents were OCR-ed, then classified on the basis of the recognized text in order to differentiate e.g. personal letters from business correspondence. Named entity recognition on the texts provided us with a network of people and places, with links to the letters. Attempts at handwriting recognition on the basis of ‘image texture’ histogram comparisons provided mixed results, - for the instances where larger samples of a single person’s handwriting were available it worked reasonably well, but for handwriting types occurring only a few times the confidence of the classifier was too low and such documents were classified as one of the more frequently occurring types. The results of these steps, in combination with the existing index’s metadata, provided a rich enough metadata structure for the use of individual documents in the tool.
In addition to a discussion of these steps, our paper reflects on the results’ epistemological implications for future research, by discussing them in relation to previous quantitative approaches to the Desmet Collection. From this vantage point, our paper argues that while previous quantitative studies of Desmet’s business documents were premised in the coding and transcription procedures of Cliometrics and Annales historiography, MIMEHIST’s results nurture exploratory and qualitative research coupled with serendipitous search and annotation procedures focusing also on visual features. Consequently, the paper argues, researchers may to a greater degree than hitherto highlight data contingencies and multiplicity of viewpoints in the Desmet business archive.