Ontological Challenges in Editing Historic Editions of the Encyclopedia Britannica

Peter M Logan (peter.logan@temple.edu), Temple University, United States of America

First published in 1771, Encyclopedia Britannica continues in publication today and is the only encyclopedia in any language to survive that 250-year period. Historical editions of the Encyclopedia offer scholars a unique means of examining the evolution of ideas and beliefs about sensitive cultural topics – such as suicide, race, and hysteria – by studying their treatment in different editions. But what can this curated dataset as a whole can tell us about larger patterns in the social construction of knowledge in the nineteenth-century English-speaking world?

We are creating a data set of all text from these historic editions for use in text mining. The corpus will include over 100,000 entries, all of which need to be tagged with essential metadata fields. How do we identify the different subject areas in this body of knowledge? This article briefly discusses the use of an automatic-metadata-generating algorithm, HIVE, created by the Metadata Research Center at Drexel University. But the central issue it addresses is the theoretical problem encountered in defining a subject vocabulary for this corpus.

The Encyclopedia claims to represent the “Sum of Human Knowledge,” and while we can dispute this claim, it nonetheless represents the existence of older knowledge taxonomies used in its creation. How do we construct a subject vocabulary without distorting this older organizational scheme for subject categories? Those older vocabularies were clearly biased. For example, the decision to include or exclude entries, as well as the size assigned to entries, were all based on assumptions about what mattered as “legitimate” knowledge. Many of these are assumptions we no longer share; the editors excluded forms of knowledge rooted in folk and tribal cultures, and female authors were wholly absent until 1889. Racism and the perspective of British Imperialism are evident in many entries. These prejudices reflect the social beliefs of the writers and editors, of course, and as such, they illustrate the degree to which knowledge in the nineteenth century was clearly socially constructed. And the invented nature of that taxonomy needs to be captured accurately. The value of the curated content of Britannica to researchers today is that is the most comprehensive representation we have of that older knowledge system in its totality, and so it makes it possible to study that system as a structure and to observe how it changed over time.

The problems in tagging this biased dataset take three forms. First is the danger of historical anachronism. Applying a C21-century ontology, like Library of Congress Subject Areas, to C18 and C19 editions makes it accessible to modern researchers, but it also misrepresents the older system of knowledge. For example, the entries on “History” from the important 3rd (1797) and 7th (1842) editions present authoritative accounts of human prehistory. While we might tag them under “anthropology,” that field of knowledge was not recognized by the Royal Academy of Sciences until the 1880s (as a subset of Biology) and does not appear in the Encyclopedia itself until 1889. In fact, the older references cite the Book of Genesis as their authority, and a tag on applications of scripture to the interpretation of external reality might better represent the entry than an anachronistic “Anthropology, history of” tag could do.

The second difficulty is encountered when trying to reconstruct the older ontology used by the Encyclopedia, because it was a moving target. Subject categories changed over the first 150 years, with new categories added, others (like human prehistory) moving from one field to another, and still others disappearing. While we might construct a stable ontology for one edition, any historically-accurate ontology will have to become a system of multiple ontologies, whose relationships with one another need to be explained at the very least.

Third is the question of how to treat the built-in biases within the corpus. Older ontologies of knowledge are rife with bias, often through omission. Historically-accurate subject terms duplicate that bias. Information on attitudes toward women and national minorities, for example, exists within multiple entries, but there are no subject terms for minorities and no entries for women as such, making that data largely invisible without some form of intervention.

We are in the process of creating this new dataset and by summer of 2018 we will be completing preliminary tests on tagging systems, so the final paper will share preliminary results.