Building a Community Driven Corpus of Historical Newspapers

Claudia Resch (claudia.resch@oeaw.ac.at), Austrian Academy of Sciences, Austria and Dario Kampkaspar (dario.kampkaspar@oeaw.ac.at), Austrian Academy of Sciences, Austria and Daniela Fasching (daniela.fasching@oeaw.ac.at), Austrian Academy of Sciences, Austria and Vanessa Hannesschläger (vanessa.hannesschlaeger@oeaw.ac.at), Austrian Academy of Sciences, Austria and Daniel Schopper (daniel.schopper@oeaw.ac.at), Austrian Academy of Sciences, Austria

Faced with the challenge of organizing the digital processing and publication of a large collection of historical newspaper data from the 18 th century publication known as the Wien[n]erisches Diarium, a small project located at the Austrian Centre for Digital Humanities (ACDH) in Vienna has opted for a user-centred, participatory approach and employs methods of community involvement to tackle the specific challenges that arise from the particular qualities of the historical source material.

Founded in 1703, the newspaper under investigation is among the oldest periodical publications still being published today, and was regarded as the most important newspaper of the Habsburg Monarchy for a considerable time span during the 18 th century. The value and significance of the newspaper as a source is undeniable, not only due to the density of the information it contains, but also because of the virtually gapless preservation of its run from its foundation in 1703 up until today and the full availability of these original sources. So far, no computer-based processing of this historical data cache has been undertaken. The ACDH project aims at facilitating the use of the source in a digital environment and creating a cornerstone resource, making the Diarium freely and easily available to researchers everywhere.

The more than 10.000 issues from the 18 th century constitute a mass of text and data. As resources are limited, a number of issues manageable within the project’s run had to be selected. For now, the project will thoroughly edit a corpus of approximately 500 issues from all decades of the 18 th century. The priority is the quality of the data and the creation of a reliable HTR model that will improve automatic processing and pave the way for expanding or completing the existing corpus at a later point.

As not all queries and research questions that may be posed to the sources can be anticipated, it is the project’s primary aim to secure and process the full text of the newspaper in a way that does not disregard or omit any of the relevant information – regardless of the querying researcher’s field or discipline. In order to determine which aspects are of particular relevance, where the interests of different disciplinary fields overlap, and how the issues should be prepared and presented to make them useful for the largest number of (academic) users, the digitization project has devised a way to work closely with researchers from various backgrounds.

The project’s community-driven approach invites and relies on participation on several levels, effectively allowing future users to follow, accompany and shape the project throughout the course of its duration. The following three methods of user involvement were or are being employed in the course of the digitization and annotation process:

1) In spring 2017, a call for nominations promoted via digital channels and the print version of the newspaper provided an opportunity for prospective users to nominate specific issues or sets of issues for digitization.

2) While the text recognition process does not involve users, the project team nevertheless upholds the principle of transparency by allowing users to track the progress of the procedure: A reporting tool developed for this purpose is accessible via the project website, provides a current list of the issues selected for processing and allows users to track the daily progress in real time.

3) A series of community-driven annotate-a-thons allow the project team to survey and adapt to the user community’s needs. Consulted as experts and prospective users, (peer) researchers are involved in the annotation process early on and contribute specialised knowledge to the enrichment of the data.

To ensure users’ ongoing engagement with the texts even beyond the initial phase and to provide a way to preserve and publicize the results, the platform has been designed with continuous annotation activities in mind. Any user shall be able to make annotations and contribute to the encoding source via the web-app, which will support four basic types of annotations: 1) full text, 2) named entity identification, 3) text or layout corrections, and 4) semantic or structural annotations.

In pioneering a user-centred approach in the development of a digital newspaper resource, the Diarium project generates new insights in the potential of community involvement for similar projects. It roadtests methods for motivating both digital and ‘traditional’ humanities researchers to contribute to a collaborative resource and for creating highly sustainable and re-usable resources that will meet the needs of diverse user communities, and encourage ongoing engagement.