Expanding the Research Environment for Ancient Documents (READ) to Any Writing System

Andrew Glass (asg@uw.edu), Microsoft Corp., University of Washington

The Research Environment for Ancient Documents (READ) is an integrated Open Source web platform for epigraphical and manuscript research. The original goal of the READ platform was to support scholars in researching and presenting studies of handwritten documents and inscriptions preserved in Gāndhārī language using the Kharoṣṭhī script. Since many of the workflows supported by READ are common to epigraphic and manuscript studies in other textual traditions we wanted to investigate how READ could be generalized to support other writing systems. This presentation will share the results of that investigation with examples from English, Aramaic, Chinese, and Mayan.

Three core components of the READ data model depend on the writing system used by the source material:

  1. The link between physical and textual data
  2. The constraint mechanism that allows a user to edit text without disrupting links
  3. The sort weight API that allows data in the model to be displayed in an expected sort order

Part One. The database model underlying READ was designed to reflect the separate components and layers of interpretation which manuscript scholars and epigraphers typically use in their research (letter forms => paleography; graphemic units => phonology; inflectional forms => morphology, etc.). Furthermore, the model recognizes a continuum of factual confidence beginning from statements of fact (e.g., the name of a collection in which an item is kept), to data which may have multiple or variant interpretations (e.g., the transcription of a sample of writing). Such variant data is linked back through the model to original facts. At the crux of this system of links are the references between segments on an image each containing an orthographic unit in the writing system and the transcription of that unit. Because READ was originally developed for Kharoṣṭhī, an alphasyllabary or Abugida-type writing system, this link maps image segments to syllable clusters. Other writing systems can be supported by mapping the syllable cluster to the appropriate orthographic units. This has been tested by mapping syllable clusters as follows: English letters, Aramaic syllables, Chinese logographs, and Mayan syllables and logographs.

Part Two. READ is intended to be a working environment for born-digital text editions. A critical feature of the model is that links created within the system must be preserved during repeated editing. The editing interface allows users to modify linked syllable clusters. By constraining edits to valid transcriptions of a syllable cluster defined for the language, READ can keep track of user edits and prevent links from being broken. Other writing systems can be supported by defining the valid transcription forms for the orthographic units. In most cases this is less complex than for akṣara-based writing systems. This has been tested by defining valid orthographic units as follows: English – Consonants, Vowels; Aramaic syllables - Consonants, Vowels, Consonant with modifier; Chinese – Logograph; Mayan – Logograph, CV syllable. All systems also permit orthographic units to be Digits and Punctuation signs.

Part Three. READ uses custom sort tables to weight the orthographic units and subunits used by the model. Having custom sort tables allows correct sorting of Romanized transcription when the expected sort order is not equal to standard ‘ABC’ order. Other writing systems represented in Romanized transcription with non-standard sorts require dedicated sort tables. Alternatively, writing systems represented in native script via Unicode may be sorted via their Unicode weights. This has been tested using standard ABC weights for English, custom weights for Mayan transcription, Unicode weights for Hebrew transcription of Aramaic, and Pinyin sort weights for Chinese logographs.

The outcome of these investigations has been that the READ architecture is generalizable, and that the READ platform could be employed by projects with a focus on documents in any writing system.