Interpreting Difference among Transcripts

Michael Sperberg-McQueen (cmsmcq@acm.org), Black Mesa Technologies LLC, United States of America and Claus Huitfeldt (Claus.Huitfeldt@uib.no), University of Bergen

XML

1.
Introduction

The semantics and logic of transcription have received attention from a number of digital humanists, some starting from the practice of digital editions [Pierazzo 2011, 2015], some from a consideration of markup languages [Robinson 1994, Huitfeldt 1995], some from a critical examination of the foundations of digital humanities [Caton 2009, 2013a, 2013b, 2014].

Attempts to describe how transcripts provide information about their exemplars [Huitfeldt/Sperberg-McQueen 2014, Sperberg-McQueen et al. 2017] have focused on individual transcripts, not multiple divergent transcripts of the same exemplar. Here we describe ways in which transcripts of the same exemplar can differ and we sketch a model of transcription which accounts for such differences.

2.
Examples

Our catalog of ways in which transcripts differ and disagree takes the form of examples, many illustrating exceptions to the general rule that a transcript reflects “the exemplar, the whole exemplar, and nothing but the exemplar” and that competent transcribers will agree on the reading of the exemplar [Sperberg-McQueen et al. 2014].

For brevity, examples often consider only single words; discussions refer to the first of several transcripts as A, the second as B, an arbitrary transcript as T, and the exemplar as E. Some transcripts were constructed for this paper.

2.1.
Transcripts which differ and disagree

2.1.1.
Example LW: what type does this token instantiate?

E is a word from Ludwig Wittgenstein’s notebook 117 (p. 269), written in a simple substitution code [Wittgenstein, n.d.]. A and B, ignorant of the cipher, transcribe it as “munonyqi” and “wunouyqi”, respectively. C, better informed, has “muuvnyzi” (“offenbar”).

The transcripts reflect contradictory readings of the token in E; at most one can be correct.

Here all transcripts agree on which marks in E are tokens, but disagree on the types they instantiate. We infer that a transcript’s mapping from tokens in E to types is a salient feature for modeling.

2.1.2.
Example MCN: which marks are tokens?

E is a tombstone from northwestern Britain [Collingwood/Wright 1965-1990, no. 932].

A [Lafleur 2010 p. 28f.] reads the mark between some word pairs as a punctum:

DIS

MANIB · M · COCCEI

NONNI · ANNOR · VI

HIC SITVS EST

B is similar except for the last line: ‘HIC · SITVS · EST’.

Here A and B do not disagree over the reading of any tokens; they disagree on what marks in E are tokens. A formal account must distinguish the identification of tokens from the mapping of tokens to types.

2.1.3.
Example TE: what is the structure of this text?

At the eastern end of Magdeburg cathedral lies the Tumba Edithae (tomb of Edith), with an inscription part of which is shown here.

A [Neugebauer/Brandl 2012] begins reading on the south:

B begins reading on the north with “INPVLSV” but otherwise resembles A.

A and B agree in their readings of each individual character and word and also in identifying which marks in E are writing; they differ only in the higher-level structure(s) compounded from words and characters. A model of transcription must include such higher-level structural organization as a substantive part of transcription; so similarly [Huitfeldt et al. 2010].

2.2.
Transcripts which differ without disagreeing

That some differences between transcripts do not signal disagreements about E goes (almost) without saying. A and B can differ in pagination and running heads without disagreeing on how to read E: page furniture is normally an exception to the general rule that everything in T transcribes something in E.

2.2.1.
Example JA: literal transcription and marked corrections

E is a word from a letter of Jane Addams [Hajo et al. 2015-].

A writes “altho”, B “altho[
ugh]” (bracketed italics mark editorial additions), and C “[although]” (brackets mark editorial interventions). A and B thus differ in content but agree that E has “altho”. B and C provide the same normalized spelling but provide different (albeit compatible) information about E.

B and C assign special meaning to brackets and bracketed material: unlike other characters, they transcribe nothing in E.

An account of transcription must specify which tokens in the transcript are to be interpreted as transcribing tokens in E and which not.

2.2.2.
Example SJ: long and short s

E is one word from a sonnet by Sor Juana Inés de la Cruz [Sor Juana 1700 p. 163].

A (a web site presenting Sor Juana’s work) writes “vista”, B “viſta”.

A and B differ but do not disagree. Both identify the third character of this word as a lower-case S; B further specifies a long S. If we take A to be ambiguous (the S could be long or short), then A subsumes B: B provides additional but not contradictory information.

In E, however, long and short S are allographs in complementary distribution; typographic context determines which appears. In vi_ta, S will always be long not short. So in reality A provides the same information as B, not less.

Many differences without disagreement arise where one transcript preserves allographic differences and the other preserves graphemes. Arguments on the topic involve no disagreements about E, only about the choice of type system. A model should clarify the role of type systems in transcription.

2.2.3.
Example TJ: word-level and character-level fidelity

E is from Thomas Jefferson’s draft of the U.S. Declaration of Independence.

A: the laws of nature & of nature’s god

B: the laws of nature and of nature’s God

A [Boyd 1950- p. 1:423] preserves and reinstantiates the type of each character, while B [Harrison 1993 p. 39] preserves and reinstantiates the type of each word but not each character. (That is, it normalizes the spelling of words.)

Here the type systems of A and B diverge even more sharply than above. A formal account must be able to describe type-system differences of this kind.

2.2.4.
Example FK: typographic rendering of inscription details

E is the words “para siempre” in a letter from Frida Kahlo to Diego Rivera [Kahlo]. A and B differ only in rendering the underscoring in E as italics or underscoring.

Typographic features of T often convey information about E, but different transcribers use different conventions. A formalization must account for such conventions.

2.2.5.
Example FE: completeness and incompleteness

E is a grave marker (Naples, fourth/fifth century CE) now in the Jewish Museum, New York (JM3-50).

A:	B:

B transcribes all of E, A [Lafleur 2010 p. 144]] only the Latin text. Any model must describe how we know which material in E is transcribed and which (if any) is not.

2.2.6.
JLM: Transcripts which disagree without differing

It is hard to find plausible examples of this class of phenomena. But an imaginary example may illustrate it. If A uses italics to mark editorial insertions, and B to represent underlining in E, then

John
loves Mary.

will mean different and contradictory things in A and B.

Differences in typographic conventions and type system can lead to conflicting interpretations of T. A model must describe how such conflicts arise.

3.
Formal model

Space constraints limit us to a sketch.

We assume the concepts
type,
token, and
document. Types and tokens are not limited to graphemes or words but include larger structures. The document itself is typically a compound token, and its text a compound type.

A set of mutually exclusive types we call a type inventory. Tokens instantiate exactly one type in an inventory: a letter is an I or a J, but not both. Transcriptions commonly involve not one type inventory but several. (“I” is both a letter and a word.) A set of type inventories is a type system.

A reading of a token k with respect to a type inventory I maps k to a type p in I; we write (k, I, p) for such a reading.

A reading of a document D identifies a set K of tokens in D and maps them to types. We write R = (D, K, P, M), where P is a type system and M a set of triples (k, I, p) where k ∈ K, I ∈ P, p ∈ I. Every k in K maps to at least one type; none maps to two types in the same inventory.

Examples MCN and TE illustrate differences in K, examples SJ and TJ differences in P, example LW differences in M.

Transcription policies determine which tokens in E are transcribed (normal) and which not (special); similarly which tokens in T transcribe E (normal) and which do not (special). They also constrain the type system by distinguishing some types and equating some with each other. A transcription policy is thus a triple (SE, ST, Q), where SE and SE are predicates true of all and only the special tokens in E and T respectively, and Q is a set of type equivalences.

Examples FE and JLM illustrate differences in SE, example JA a difference in ST, and example FK a difference in Q.

From a reading of T we can reconstruct a reading of E based on an assumed transcription policy; this allows readers of T to have information about E without examining E directly.

Appendix A

Bibliography

Boyd, J. P., et al. (eds). (1950-).
The Papers of Thomas Jefferson. Princeton: Princeton University Press. Quoted from (Stevens/Burg 1997), p. 81.
Caton, P. (2009). Lost in Transcription: Types, Tokens, and Modality in Document Representation. Paper given at DH 2009, held June 2009 at College Park, University of Maryland.
Caton, P. (2013a). Pure transcriptional markup. Paper given at DH 2013, held July 2013 at the University of Nebraska in Lincoln.
Caton, P. (2013b). On ‘text’ in Digital Humanities.
Literary & Linguistic Computing 28.1 (2013): 209-220.
Caton, P. (2014). Six terms fundamental to modelling transcription. Paper given at DH 2014, held July 2014 at the University of Lausanne. Short version on the Web at

http://dharchive.org/paper/DH2014/Paper-780.xml
.
Collingwood, R. G., Wright, R. P. (eds). (1965-1990).
The Roman inscriptions of Britain (RIB). Vol. 1 Oxford: Oxford Univ. Press; Vol 2 Gloucester: Alan Sutton. Image and transcript of RIB 932 reproduced from (Lafleur 2010), pp. 28f.
Driscoll, M. J. (2006). Levels of transcription. In (Unsworth 2006). On the web at http://www.tei-c.org/About/Archive_new/ETE/Preview/driscoll.xml.
Hajo, C. M., et al. (eds). (2015-). J
ane Addams Digital Edition. Mahwah, NJ: Ramapo College of New Jersey. https://digital.janeaddams.ramapo.edu. The letter cited is from Jane Addams to Florence Kelley, February 1, 1901. https://digital.janeaddams.ramapo.edu/items/show/64.
Harrison, M., and Gilbert, S. (eds). (1993).
Thomas Jefferson Word for Word. La Jolla: Excellent Books. Quoted from (Stevens/Burg 1997), p. 82.
Hayford, H., Parker, H., and Tanselle, G. T. (eds). (1988).
Moby Dick, or, The Whale. Vol. 7 of
The Writings of Herman Melville. The Northwestern–Newberry Edition. Evanston [Ill.]: Northwestern University Press; Chicago : Newberry Library. Rpt. 1994, 1997.
[Huitfeldt, C]. (1993).
MECS-WIT, A registration standard for the Wittgenstein Archives at the University of Bergen. [Bergen]: Wittgenstein Archives, 1993. Currently on the Web at http://folk.uib.no/fafch/oldstuff/mecswit.html.
Huitfeldt, C. (1995). Multi-dimensional texts in a one dimensional medium.
Computers and the Humanities 28: 235-241.
Huitfeldt, C. (2006). Philosophy case study. In (Unsworth 2006). On the web at http://www.tei-c.org/About/Archive_new/ETE/Preview/huitfeldt.xml.
Huitfeldt, C., Marcoux, Y., and Sperberg-McQueen, C. M. (2010). Extension of the type/token distinction to document structure. Paper presented at Balisage: The Markup Conference 2010, Montréal, Canada, August 3 – 6, 2010. In Proceedings of Balisage: The Markup Conference 2010. Balisage Series on Markup Technologies, vol. 5 (2010). doi:10.4242/BalisageVol5.Huitfeldt01. On the Web at http://www.balisage.net/Proceedings/vol5/html/Huitfeldt01/BalisageVol5-Huitfeldt01.html.
Huitfeldt, C., and Sperberg-McQueen, C. M. (2017). Transcriptional implicature: Using a transcript to reason about an exemplar. Paper given at DH 2017, held August 2017 at the McGill University, Montréal. Short version on the web at https://dh2017.adho.org/abstracts/235/235.pdf.
Kahlo, F. (1940). Letter to Diego Rivera, 1940. Emmy Lou Packard Papers 1900-1990, Archives of American Art, Smithsonian Institution. Facsimile of letter on the Web at https://www.aaa.si.edu/collections/items/detail/frida-kahlo-letter-to-diego-rivera-739
Kline, M.-J. (1987).
A guide to documentary editing. Baltimore and London: Johns Hopkins University Press; second edition 1998.
Lafleur, R. A. (2010).
Scribblers, sculptors, and scribes: A companion to Wheelock’s Latin and other introductory textbooks. [New York]: Collins Reference.
Mommsen, T., et al. (eds). (1863-).
Corpus Inscriptionum Latinarum. Berlin: Georg Reimber. Image and transcript of CIL 12 498 reproduced from (Lafleur 2010)xs, p. 14.
Neugebauer, A., and Brandl, H. (2012). ubi sancta requiescit Aedith. Das Grabmal der Königin Editha im Magdeburger Dom. In Meller, H., et al. (eds),
Königin Editha und ihre Grablegen in Magdeburg. (Archäologie in Sachsen-Anhalt, Sonderband 18.) Halle, pp. 33-54.
Pierazzo, E. (2011). A rationale of digital documentary editions.
Literary & Linguistic Computing 26.4: 463-477.
Pierazzo, E. (2015).
Digital scholarly editing: Theories, models and methods. Aldershot: Ashgate , 2015.
Robinson, P. (1994).
The transcription of primary textual sources using SGML. Office for Humanities Communication Publications, Number 6. [Oxford: OHC].
Robinson, P., and Solopova, E. (1993). Guidelines for the transcription of the manuscripts of the Wife of Bath’s Prologue. In Blake, N., and Robinson, P. (eds),
The Canterbury Tales Project Occasional Papers Volume I. Office for Humanities Communication Publications, Number 5. [Oxford: OHC], 1993.
Sanger, M. (1914).
The Woman Rebel. Vol. 1 No. 1. From Katz, E., Hajo, C. M., and Engelman, P. C. (eds).
The Margaret Sanger Papers. Sample from the MSP in the Model Editions Partnership at http://modeleditions.blackmesatech.com/mep/.
Sor Juana Ines de la Cruz. (1700).
Fama y obras posthumas del fenix de Mexico, Decima musa, poetisa americana Sor Juana Ines de la Cruz, Reliogiosa profesa, [ed.] Don Juan Ignacio de Castorena y Visua. Madrid: Manuel Ruiz de Murga. Digitized version by the University of Bielefeld on Web at http://ds.ub.uni-bielefeld.de/viewer/image/1592397/1/; page 163 is at http://ds.ub.uni-bielefeld.de/viewer/image/1592397/153/as.
Sperberg-McQueen. C. M., Marcoux, Y., and Huitfeldt, C. (2014). Transcriptional implicature: A contribution to markup semantics. Paper given at DH 2014, held July 2014 at the University of Lausanne. Short version on the Web at http://dharchive.org/paper/DH2014/Paper-61.xml.
Stevens, M. E., and Burg, S. B. (1997).
Editing historical documents: A handbook of practice. Walnut Creek, Ca.: Altamira Press, published in cooperation with the American Association for State and Local History, the Association for Documentary Editing, and the State Historical Society of Wisconsin.
Unsworth, J., O’Brien O’Keeffe, K., and Burnard, L. (eds) (2006).
Electronic textual editing. New York: MLA.
Vander Meulen, D. L, and Tanselle, G. T. (1999). A system of manuscript transcription.
Studies in Bibliography 52: 201-212.
Wittgenstein, L. (n.d.). Wittgenstein Source. Curator: Alois Pichler, Wittgenstein Archives at the University of Bergen (WAB). http://wittgensteinsource.org/. Includes material from the Bergen Electronic Edition (BEE) of Wittgenstein’s Nachlaß.

Interpreting Difference among Transcripts

1. Introduction

2. Examples

2.1. Transcripts which differ and disagree

2.1.1. Example LW: what type does this token instantiate?

2.1.2. Example MCN: which marks are tokens?

2.1.3. Example TE: what is the structure of this text?

2.2. Transcripts which differ without disagreeing

2.2.1. Example JA: literal transcription and marked corrections

2.2.2. Example SJ: long and short s

2.2.3. Example TJ: word-level and character-level fidelity

2.2.4. Example FK: typographic rendering of inscription details

2.2.5. Example FE: completeness and incompleteness

2.2.6. JLM: Transcripts which disagree without differing

3. Formal model