Linked Books: Towards a collaborative citation index for the Arts and Humanities

Giovanni Colavizza (giovanni.colavizza@epfl.ch), École Polytechnique Fédérale de Lausanne, Digital Humanities Laboratory; The Alan Turing Institute; Odoma Sàrl and Matteo Romanello (matteo.romanello@epfl.ch), École Polytechnique Fédérale de Lausanne, Digital Humanities Laboratory; Odoma Sàrl and Martina Babetto (babetto.martina@gmail.com), Tate Britain and Vincent Barbay (vincent.barbay@gmail.com), École Polytechnique Fédérale de Lausanne, Digital Humanities Laboratory and Laurent Bolli (laurent.bolli@odoma.ch), École Polytechnique Fédérale de Lausanne, Digital Humanities Laboratory; Odoma Sàrl and Silvia Ferronato (silvia.isotta@gmail.com), École Polytechnique Fédérale de Lausanne, Digital Humanities Laboratory and Frédéric Kaplan (frederic.kaplan@epfl.ch), École Polytechnique Fédérale de Lausanne, Digital Humanities Laboratory

We present the Scholar Index: a platform to index the literature and primary sources of the arts and humanities through citations. These resources are becoming increasingly digital, thanks in part to digitization campaigns and a shift towards digital publishing. Nevertheless, the coverage of commercial citation indexes is still poor and mostly limited to publications in English (Mongeon and Paul-Hus, 2015). This situation results in an untapped opportunity, as the literature refers to a wealth of primary sources from institutions such as libraries, archives and museums (Knievel and Kellsey, 2005). As a consequence, a comprehensive indexing of its citations would constitute a unique opportunity to greatly enhance the search capacities of scholars and interconnect collections currently set apart.

The Scholar Index integrates a pipeline to extract citations from scholarly literature in the arts and humanities, along with two interfaces: a digital library (Scholar Library) and a citation index (Scholar Index proper). The prototype Venice Scholar is presented, covering the literature on the history of Venice and currently indexing nearly 3000 volumes of scholarship from the mid 19 th century to 2013, from which some 4 million references were extracted. The full citation indexing allowed us for the first time to highlight trends in the large-scale use of archival evidence and scholarly literature made by historians over such a substantial span of time (Colavizza, 2017a, b). We finally argue that a collaborative approach to the indexing of the literature and primary sources of humanists is feasible and would allow to greatly broaden and enrich access to the documentary cultural heritage at large.

Approach

The process of mining citations from digital or digitized scholarly publications entails a set of steps, as sketched below. First of all, a corpus needs to be selected and digitally acquired (including its full-text via OCR). Secondly, the literature needs to be analysed in order to grasp the location of references (typically in footnotes), and the presence of trends in the style of references. These insights can inform a selection of publications to be manually annotated in order to improve the quality of the subsequent automated extraction.

Citation mining can be divided into two tasks: parsing and extraction of references – or the identification of text segments containing a reference to a source – and their disambiguation – or the association of a reference to the unique identifier of the referred source. Having done this, a citation is represented as a relation between a citing publication and a cited source. During parsing, pre-trained text classifiers are used, possibly with adaptation to the domain at hand. During disambiguation, external repositories such as catalogues are optionally queried to establish interlinks. Lastly, citation data can be exposed for a variety of purposes, including search and browsing in a dedicated interface. For more details and evaluations see (Colavizza and Romanello, 2017; Colavizza et al., 2017).

The Venice Scholar

This approach has been applied to create a prototype on the historiography on Venice. There is a sheer amount of literature on Venice, even just considering modern historiography from the 19 th century onwards (Dursteler, 2013). We selected the corpus using a variety of means available in research libraries (Colavizza et al., 2017). Once a first seed of literature had been digitized, we proceeded to further expand it with highly cited, usually old sources, as well as very recent ones. The corpus currently counts over 3000 volumes, circa 20% journal issues and 80% books. This effort has been made possible thanks to the support of several research libraries in Venice. 1 The resulting data has been published in open access: circa 40,000 annotated references used to train reference parsers (Colavizza and Romanello, 2017), while citation data from nearly 4 million extracted references is gradually being ingested into OpenCitations (Peroni et al., 2017), a repository of open citations data. 2

The platform

The digital library and the citation index are connected through citations. The interfaces are accessible online. 3 The digital library provides access to the digitized materials, and points to the index through disambiguated references, as shown below.

Viewer (above): allows to read a publication with image and text side by side. This is particularly important in order to appreciate the quality of the OCR.

Text view (left): allows the user to search within the full-text of a publication, highlighting all extracted references and links to the relative entries in the index.

The citation index provides instead no access to full-contents – due to reasons of copyright – but allows for the exploration of the network of citations.

Search results (left): citation data is aggregated per author, publication or primary source, with full-text access to the text of extracted references. Search results are conveyed along with their relevant citation information (citations made and received, publications for an author). Authors are linked to the Virtual International Authority File (VIAF) repository, whenever possible.

Citation timeline (left): every aggregated entity has a dedicated page with a timeline of citations (made and received), and a list of relevant sources.

Citations to primary sources (left): the index also links to external collections of primary sources, in this case documentation at the Archive of Venice. Citations to any level of the archival hierarchy are provided, following its structure. The user can easily move from a publication to a document series and see all publications which referred to it, over time.

The platform is thus able to aggregate citation data from many library collections into a unique system, allowing users to not only navigate the resulting network, but also have improved access to collections of primary sources such as archives.

Citation coverage of the Archive of Venice

The Venice Scholar allowed for the first time to analyse the use made by historians of the vast Archive of Venice over almost 200 years of scholarship. The Archive hosts an estimate 80 linear km of documents and is the main reference for the history of the city. We extracted 157,575 citations to 600 distinct record groups (record groups or smaller series of documents within). Two patterns readily emerge: first, the use of documentary records is highly skewed with few record groups accruing most citations; second, the archival indexation of the records through metadata is key for their discoverability by historians.

The 600 cited record groups vary from one with almost 10,000 received citations to many having been cited only once. The cumulative distribution of citations suggests the presence of a power law ( above: the inset plot shows the log-log cumulative sum resulting in an almost straight line).

The Archive possess 14,233 record groups indexed in its information system, meaning only 4,2% of these have been cited from our corpus. Within these record groups, only 25,7% possess information regarding their size in linear meters, which is usually missing for records not yet properly inventoried. Yet 78,5% of cited record groups possess size information: a strong indication of the importance of archival indexation for access. A total of only 18,37 linear km are known to the information system, and of these 64,3% are cited at least once at the aggregated group level. In conclusion, a proportion of ~4% (by archival identifier) to ~15% (by size, (18,37*0.643)/80) of the record groups of the Archive of Venice has been cited from our corpus, and with few aggregates getting most citations: we might conclude that there is still much to explore at the Archive of Venice.

Towards a global citation index for the Arts and Humanities

We believe that the approach used for Venice could be applied to many more collections of scholarly literature, from a variety of libraries. Much in the same way national or international library catalogues are collaboratively created, every library part of the system could take responsibility for an area of scholarship of its interest. This would entail for the library to be in charge for digitization. Once done, the platform would proceed to OCR and mine citations, in view of their federation into a single citation index ( left). The library could also be responsible for the quality of the so provided citation data, by running regular evaluation and correction campaigns according to its resources. A daunting volume of work would thus be divided into more manageable chunks, and possibly distributed among several players.

The expansion of digital informational ecosystems promises to greatly impact the work of humanists. Besides providing for more rapid access, digital indexing might also make information retrieval a richer experience. Towards this end, we proposed the Scholar Index: an approach to use citations contained in the scholarly literature to index and interlink collections of primary and secondary sources. We believe that our approach has the merit of being able to scale, by catalysing the joint efforts of knowledge institutions towards a common goal, as was shown for the Venice Scholar prototype.


Appendix A

Bibliography
  1. Colavizza, G. (2017a). “The Core Literature of the Historians of Venice.” Frontiers in Digital Humanities.
  2. Colavizza, G. (2017b). “The Structural Role of the Core Literature in History.” Scientometrics.
  3. Colavizza, G., Romanello, M. and Kaplan, F. (2017). “The references of references: a method to enrich humanities library catalogs with citation data.” International Journal on Digital Libraries.
  4. Colavizza, G. and Romanello, M. (2017). “Annotated references in the historiography on Venice: 19 th-21 st centuries.” Journal of Open Humanities Data.
  5. Dursteler, E. R. (2013). “A brief survey of histories of Venice.” In: A Companion to Venetian History, 1400–1797, edited by Eric R. Dursteler. Leiden: Brill.
  6. Knievel, J. E. and Kellsey, C. (2005). “Citation analysis for collection development: a comparative study of eight humanities fields.” Library Quarterly.
  7. Mongeon, P. and Paul-Hus, A. (2015). “The journal coverage of Web of Science and Scopus: a comparative analysis.” Scientometrics.
  8. Peroni, S., Shotton, D. and Vitali, F. (2016). “Freedom for bibliographic references: OpenCitations arise.” In Proceedings of 2016 International Workshop on Linked Data for Information Extraction.
Notes
1.

For the list of partners see: https://dhlab.epfl.ch/page-127959-en.html .

3.

The (Venice) Scholar Index can be accessed at: www.venicescholar.eu , the (Venice) Scholar Library can be accessed at: www.venicescholar.eu/library . Try searching for historian “Patricia Fortini Brown”, for example. The project’s website is at: www.scholarindex.eu .