A New Methodology for Error Detection and Data Completion in a Large Historical Catalogue Based on an Event Ontology and Network Analysis
The catalogue of Historical Hebrew Manuscripts, curated by the National Library of Israel, represents the largest collection in the world of over 130,000 Hebrew manuscripts that survived through the last millennium and are currently spread off in a variety of institutions all over the globe. The catalogue was created by many different classifiers during the long period of some 70 years. As a result, many of the fields are inconsistent and unorganized (Zhitomirsky-Geffet and Prebor, 2016). Moreover, a deeper examination of the data reveals missing and incorrect information (e.g. manuscripts with unknown date and place of writing). This missing and incorrect information poses a great pitfall for researchers who need reliable data to base their research on (Hric et al., 2016).
In this paper we present a novel approach for completion and correction of historical data from a large manuscript catalogue based on an event-based ontology and network communities' analysis. To resolve data inconsistencies in the catalogue, in the previous study we proposed an event-based ontology model (Zhitomirsky-Geffet and Prebor, 2016). The ontology model is shown in Figure 1.
Figure 1: Ontology model of the manuscript data.
The proposed methodology comprises the following stages:
- Extraction of ontological entities from the catalogue data and ontology construction;
- Building networks of ontological entities based on direct and indirect ontological relationships between these entities, e.g. a network of censors who participated in the common Manuscript Censorship Events, or a bipartite network of manuscripts and people related to them through some events;
- Automatic community identification in the constructed networks (Blondel et al., 2008);
- Outlier detection among the related events in the network or in the closest community, i.e. if the manuscript creation event’s date is later than its censorship event’s date;
- Semi-automatic inference of missing data based on the ontological relationships and communities in the network, e.g. inferring a censor/author's missing time and place of living from the corresponding data of his peers in the community.
Here we present preliminary results of the proposed approach applied on the case of Censorship Events of Hebrew manuscripts in medieval Italy. In the context of the Counter-Reformation, during the 16 th-18 th centuries, the Catholic Church closely supervised written and printed literature. The Church appointed censors (most of them apostates and experts in the Hebrew language) to censor and approve the Hebrew books.
Figure 2: A chord diagram of censors related by common manuscripts.
The diagram in Figure 2 emphasizes the most influential censors and demonstrates the strengths of collaborations.
Figure 3: A network of censors in Italy with two relationship types – censors who worked on the same manuscript at the same time and censors who worked on the same manuscripts at different time periods (represented by red and grey links, correspondingly). Line thickness represents the number of joint manuscripts for a pair of censors. The censors were divided into seven communities by the Louvain algorithm (Blondel et al., 2008) (represented by different colours of nodes). Clicking on nodes shows time maps of censors.
The total number of relationships in the network depicted in Figure 3 is 2,037, and the number of nodes (representing Italian censors) is 62. For 37% of the censors we can observe the discrepancies between explicitly mentioned in the catalogue and automatically inferred periods of activity, the inference was based on the dates of individual censorship events in the ontology. In 5% of the cases the dates of their activity in the catalogue were incorrect (e.g. in cases where the activity period of 50 and more years). In addition, given Censorship Events’ related entities, such as, a censor name and date and manuscripts censored by him and their related dates and locations we could infer its location and as a result, we obtained places of censorship for 53.9% out of all the Censorship Events in the corpus. Eight of the inferred places did not appear in the original list of 12 places that have been recorded in the catalogue.
Grey links in Figure 3 show that the most influential censors were censoring manuscripts one after another. Such collaborations can probably be regarded as waves of censorship. Two timelines under the map display which locations were inferred and which ones were specified explicitly in the catalogue. Comparing these two timelines allows researchers to identify mistakes about time and gives suggestions for unknown locations.
To conclude, our preliminary findings show that ontology-based network analysis and detection of communities provide an effective tool for correction and completion of missing or incorrect historical data.
- Blondel, V. D., J.-L. Guillaume, R. Lambiotte, & E. Lefebvre. (2008). Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment, 2008(10): P10008, Oct. 2008.
- Hric, D., Peixoto, T. P., & Fortunato, S. (2016). Network structure, metadata, and the prediction of missing nodes and annotations. Physical Review X, 6(3), 031038.
- Zhitomirsky-Geffet M. & Prebor G. (2016). Towards an ontopedia for historical Hebrew manuscripts. Frontiers in Digital Humanities, section of Digital Paleography and Book History, 3, 3. http://dx.doi.org/10.3389/fdigh.2016.00003.