READ Workbench – Corpus Collaboration and TextBase Avatars

Ian McCrabb (ian@prakas.org), University Of Sydney, Australia

The Research Environment for Ancient Documents (READ) project commenced in 2013 with development support from a consortium of institutions (University of Washington in Seattle, Ludwig Maximillian University in Munich, University of Lausanne, University of Sydney and Prakaś Foundation) involved in the study and publication of ancient Buddhist documents preserved in Gāndhārī.

READ has been developed as a comprehensive multi-user platform for the transcription, translation and analysis of ancient Sanskrit, Gāndhārī, Pali and other Prakrit texts: manuscripts, inscriptions, coins and other documents. It is based on open source software (Postgres, PHP and JQuery), supports the TEI standard and provides an API for integration with related systems. READ is positioned as a research environment, complementary to existing textual repositories and integrated with existing dictionaries. Existing transcriptions can be imported, elaborated upon, analyzed, and then published as research output in standards-based formats. The defining innovation of READ is atomization to a semantically linked network of objects; a paradigm shift in data structure from strings of marked-up text to sequences of linked objects.

The underlying design and entity model was presented at both the Digital Humanities conference in 2015 and the 2016 Australian Digital Humanities Conference. READ has been publicly released and is supporting a wide range of corpus development projects. Whilst this presentation will follow on to briefly precis the range of research projects currently supported by READ, the focus will be on a related platform. READ Workbench (Workbench) is a server portal hosted at the University of Sydney since 2016 to ‘harness’ READ. Developed using the same technology stack as READ, it is a comprehensive management framework to support the integration of people and processes in the collaborative development, maintenance and publishing of textual corpora.

The design of Workbench evolved organically as the implementation requirements of READ expanded from a single researcher working on a single text, to the capacity to support multiple projects, each with a team of researchers collaborating on the development of an integrated corpora, with widely divergent research objectives. The fundamental objective was to design a supporting framework with which manage the balance between autonomy and collaboration in large scale projects. The approach adopted was to implement strategies, models and workflow patterns consistent with those applied in the IT consulting industry to digital content design, development and migration projects.

Workbench is a software as a service (SaaS) platform managing multiple READ installations, each with project and language specific configurations, supporting researcher collaborations across multiple institutions. It provides a self-service portal for researchers to develop, maintain, manage and publish texts without requiring technical support or the mediation of a database administrator; critical to the longer-term sustainability of corpora projects. Workbench’s three facets (configuration management, database management and corpus workflows) provide a scalable implementation architecture for READ and instantiate a comprehensive corpus development methodology.

Whilst the configuration management services might be bracketed as conventional for a SaaS platform, database management is predicated upon an architectural innovation that enables researchers to build, share, manage, maintain and publish their own texts. The adoption of a single text/single database (TextBase) as the fundamental object of development, collaboration and portability is quite a departure from conventional models where a centralized administrator manages a single monolithic corpus database.

This TextBase architecture underpins a corpus development, analysis and publishing methodology that provides significant flexibility in terms of the iteration and synchronization of three fundamental workflows: text alignment, analysis registration and text aggregation.

The text alignment process integrates image, text and model configuration data to automatically generate a database. This approach allows for the distribution of responsibility to specialists and integration of their research output to align the image and the transcription at their most atomized to generate a ‘substrate’. Rather than requiring researchers to command exacting mark-up schemas, substrate databases can be automatically generated from Word processing and Spreadsheet inputs. Workbench enables each of the specialist roles to work independently and their contributions be separately managed and integrated, ameliorating project risk by minimizing dependencies and bottlenecks.

The analysis registration process synchronizes analysis data with an existing substrate. This approach allows a researcher to work independently and externally to READ in developing their own analysis ‘strata’ and then register that strata on a substrate. Grammatical analysis, translation, semantic, syntactic and structural analysis can all be independently developed and iteratively registered. Researchers from other disciplines can develop and register their own analysis (archaeological, historical etc.). Each stratum is registered on a particular substrate (an edition) of the text within a TextBase, is separately owned and attributed, and its visibility is controlled by the researcher registering it.

The text aggregation process allows individual researchers to work and innovate in private to the point where they elect to participate in research collaborations or their text is ready for publishing. A TextBase might be aggregated with others to form a ‘sequenced’ collection, a ‘mapped’ collection or a ‘merged’ corpus; a continuum expressing an increasing degree of synthesis and harmonization of analysis ontologies and methodological standards. Researchers may contribute their TextBase to any number of aggregates. This approach allows a researcher to align a TextBase with the analysis standards of an established corpus as a predicate to participation as a constituent of that merged aggregate. In parallel, that same TextBase might be mapped to the analysis ontology of an entirely different collection. The potential exists for the same TextBase substrate to manifest as a constituent of separate aggregates, with alternative configurations of registered analysis strata, supporting widely divergent (aggregate specific) research objectives; the emanation of multiple TextBase avatars.

The strategy adopted with Workbench was to design a solution architecture within which to reframe some of the ubiquitous issues in the conventional corpus development model; ownership, control, confidentiality, innovation, standardization, portability, resourcing and support. The critical innovation in maximizing development flexibility and in balancing autonomy and collaboration across the range of individual, collection and corpora development projects is the TextBase; the target of text alignment, the substrate for registration of analysis and the object aggregated.