Creating and Implementing an Ontology of Documents and Texts

Peter Robinson (peter.robinson@usask.ca), University of Saskatchewan, Canada

Outline

The application of computing methods to scholarly editing was one of the first major areas to be explored in the nascent discipline of “Humanities Computing”: the direct ancestor of what we now know as “Digital Humanities”. The highly-structured nature of scholarly editions, with their formal links between text and carefully crafted apparatus, and the promise that complex patterns in historical texts might be usefully explored by computer methods, made them obvious targets for the early application of computer methods to humanities materials (Hockey; see also early work by Dearing and Petty and Gibson). The first formal attempt at systematic computer representation of texts, the Text Encoding Initiative, analyzed these structures into a formal encoding scheme, itself building on the principles set out in De Rose et al’s landmark 1990 publication, “What is Text, Really”.

While the TEI encodings created from 1990 onward proved a solid foundation for many scholarly editions in digital form, from the very beginning scholars recognized a fundamental problem in the TEI encodings when applied to scholarly editions. At its most basic: one wants to see the text of a manuscript on screen as it appears in the manuscript: page by page, line by line. But also one wants to see that text not as it appears in the manuscript, but according to its logical structure as an act of communication: that is, as composed of segments (Acts and scenes; or stanzas and lines; or chapters, paragraphs and sentences). Because these two views almost never correspond, we have what is usually termed the problem of “overlapping hierarchies”: paragraphs cross page boundaries; manuscripts contain multiple works, distributed in complex ways across the pages.

Many papers have addressed this issue of the “overlapping hierarchy” (De Rose; Sperberg-McQueen and Huitfeldt), and this author has wrestled with this issue across multiple editions and operating systems. In 2010, the author commenced work on a new system for collaborative online scholarly editing, “Textual Communities”. A key aim was that this system would seek a robust and fundamental solution to the problem described as “overlapping hierarchies”. Accordingly, the first task was to rethink exactly what we mean by the terms “document”, “work” and “text”. For this, the author went to textual scholarship, which has been considering the meaning of these terms for centuries. In a series of articles (2013a, 2013b, 2017) the author has explored their meaning, with the 2013a article most clearly anchoring his perceptions in the traditions of textual scholarship. In summary, these terms are defined as follows:

  1. A text is an act of communication instanced in a document
  2. The act of communication is composed of an ordered hierarchy of objects (Acts and scenes; or stanzas and lines; or chapters, paragraphs and sentences): hence, a tree
  3. The document is composed of an ordered hierarchy of objects: the volume, divided into quires, divided into leaves, divided into recto and verso pages, divided into columns, divided into lines (or, surfaces, divided into zones, etc): hence, a tree

In this analysis, every text is composed of two distinct and independent hierarchies: one tree for the document, and one for the act of communication. Both trees are essential. An act of communication cannot exist unless it is physically instantiated in a document. If the document does not present an act of communication, then it is simply marks on paper, without lexical meaning.

Textual communities formalized these definitions in an ontology. The naming system used by this ontology is based on the well-known Kahn-Wilensky system (1995), commencing with a naming authority (in this example, TC:CTP) and then using a sequence of property/value pairs to specify each object. In this case, we are describing that part of paragraph 291 of the Parson’s Tale (“PA”) which appears in line 40 of folio 232v of the Hengwrt manuscript of the Canterbury Tales:

  1. The document hierarchy: TC:CTP/Document=Hg:Page=232v:Line=40
  2. The act of communication hierarchy: TC:CTP/Entity=CTP:Part=PA:ab=291
  3. The text, combining both hierarchies: TC:CTP/Entity=CTP:Part=PA:ab=291: Document=Hg:Page=232v:Line=40

In this formulation, every text is composed of a sequence of leaves, with every leaf shared by two distinct trees. Thus the “leaf” of text contained in line 40 of folio 232v occupies TC:CTP/Document=Hg:Page=232v:Line=40; that same text is part of TC:CTP/Entity=CTP:Part=PA:ab=291. The power of the system can be readily appreciated. First, one may travel through the document hierarchy to show the text page by page, line by line. Second, one may travel through the act of communication (“entity” in our system) hierarchy to find the different versions of paragraph 291 in multiple manuscripts and compare them. In this analysis, what we term “overlapping hierarchies” is a symptom, a result of the underlying system of distinct trees sharing leaves.

Theory is one thing; implementation is another. We wanted a system that could be updated in real time. (Here, I speak of “we” as the progressing work became more and more a collaborative project). That is: a manuscript page could be transcribed, the order and structure of the text on the page rearranged, deleted, replaced, and the results written near-instantly to a storage system and available immediately to others. Over a long text (20,000 lines of the Canterbury Tales) in many manuscripts (88 for the Tales, some 30,000 pages in total) this is rather challenging. One may compare this with removing leaves from the trees, rebuilding the branches to which they were attached, and then reattaching the leaves, all in a howling gale. A brief attempt to use an XML database (in this case, XML DB, now maintained by Oracle) revealed substantial performance problems. For several years, we used a MySQL relational database. But the tables linking the distinct trees rapidly became so complex, and the queries required to manipulate them so unwieldy, that we abandoned it. Finally we moved to representing all documents in JSON form, and then storing and retrieving them through a JSON document store (MungoDB). This has proved complex, but very fast and effective. We are able to represent the two hierarchies precisely, in a manner which permits realtime updating and retrieval, within the JSON store. Indeed, one could extend the model we apply beyond two hierarchies: a text could be composed of as many hierarchies as one likes.

The first public version of Textual Communities (after seven years of work) will be released in the first half of 2018, and the author will propose a workshop on the system at the conference. This paper will show the full system briefly. It is designed to be easy to use, to the point that a textual scholar with no special computer expertise will be able to use it to create an edition. Further, the implementation of the underlying database in JSON, and of javascript throughout the system, should make it possible for computer programmers expert in Javascript (and with no expertise in XML) to make complex critical editons. The system also contains tools to allow management of a large collaborative edition, with management of transcription page by page. The sophisticated Collation Editor, developed by the New Testament Greek edition projects in Birmingham, England and Munster, Germany, itself built on CollateX, is also integrated.

This work raises many questions. XML is currently used for basic document input, and for transcription page-by-page. However, the inability of XML to fully represent more than a single hierarchy in a single document is a serious impediment to Textual Communities. In essence, the textual model we implement in Textual Communities is more powerful than XML can provide. Our hope is that others will take up this challenge, to find ways to move past this weakness in XML. Indeed, we offer Textual Communities not as, in any sense, a definitive system. It is a first attempt to implement the ontology of text and document upon which it is built. We hope and expect others will do better than this system.


Appendix A

Bibliography
  1. Dearing, Vinton A. 1962. “Methods of Textual Editing”. Clarke Library Lecture. University of California.
  2. DeRose, S., David Durand, Elli Mylonas, and Allen Renear, 1990. “What is Text, Really?” Journal of Computing in Higher Education, pp. 3-26.
  3. DeRose, S. 2004. Markup overlap: A review and a horse. Proceedings of the extreme markup languages 2004. Rockville: Mulberry Technologies.
  4. Hockey, Susan. 1980.   Guide to Computer Applications in the Humanities, Duckworth and Johns Hopkins.
  5. Kahn, Robert E., and Robert Wilensky, 1995. A Framework for Distributed Digital Object Services. Available at http://www.cnri.reston.va.us/home/cstr/arch/k-w.html.
  6. Petty, George R. and William M. Gibson, 1970. Project OCCULT: The Ordered Computer Collation of Unprepared Literary Text. New York: New York University Press.
  7. Robinson, P., 2013a. “The Concept of the Work in the Digital Age.” In Barbara Bordalejo (ed.), Work, Text and Document in the Digital Age. Ecdotica, 10, 13- 41.
  8. Robinson, P., 2013b. “Towards A Theory of Digital Editions.” Variants 10, 105- 132.
  9. Robinson, P. 2017. ‘Some principles for making collaborative scholarly editions in digital form.’ Digital Humanities Quarterly 11.2.
  10. Sperberg-McQueen, C. M. and Claus Huitfeldt, 2004 “GODDAG: A Data Structure for Overlapping Hierarchies”. In DDEP/PODDP, pp. 139-160.