Using Linked Open Data To Enrich Concept Searching In Large Text Corpora

Christine Fernsebner Eslao (, Harvard Library, United States of America y Stephen Osadetz (, Harvard University, United States of America

This poster presents the library metadata aspects of a web-based text mining application for sifting corpora of unstructured text in order to find particular passages that deal with a concept of interest. In addition to overcoming the limitations of vendor-supplied search platforms, which tend to be based on simple keyword searches that place the burden of interpreting, refining, and iterating on search results on the laborious grunt work of scholarly users (De Bolla, 2013), this tool demonstrates the utility of reconciling named entities with external structured data to refine its results and to enrich its output for use in research, visualizations, and secondary analytic tools by leveraging demographic (Hwang, 2015), temporal, and geographic data from the linked open data cloud. This necessitates the creation of entity resolution workflows with both automated matching tools and practices for manual reconciliation and maintenance, exploring a variety of open-source tools including OpenRefine (Van Hooland, 2014; Hwang, 2017), Python, and Mix’n’Match (Knoblock, 2017) and contributing to the development of “functional requirements for how [library] systems use and maintain these identifiers and associated data” (Folsom, 2017) by metadata librarians and researchers and “the complexities inherent in managing both locally-created and externally-assigned identifiers” in the context of library infrastructure (Tarver, 2017). Our goal is to integrate a tool catering to advanced researchers into library discovery platforms by “[exploring] partnerships with external entities to create game changing discovery” (Wones, 2017) and leveraging those users’ domain expertise to “interrogate corpora of resources directly ... to discover new patterns that exist across the literature, perform their own ranking of relevance against particular parameters, and find new pathways for discovery more efficiently than could be enabled through existing information portals” (MIT Libraries, 2016). The process is as follows:

1. Combine vendor metadata for large corpora with bibliographic metadata from Harvard Library collections

2. Reconcile authors, including persons and organizations, in those metadata resources, with external URIs, including those of ISNI (International Standard Name Identifier), Wikidata, and Geonames entities, generating batches of new entities in external resources at scale as needed (Mika, 2017)

3. Integrate data from external URIs into a text mining tool for sifting large corpora to drive filters and enrich data extracted from that tool

4. Work with library technology staff and metadata librarians to facilitate retrieval of rare materials in Harvard Library collections, as well as their electronic reproductions, based on results of text mining tool and integration of URIs in library metadata

5. Export resulting data to produce visualizations and secondary analytic tools

Through this process, we hope to enable the serendipitous discovery (Bourg, 2017) of relevant but unknown works in library collections: traditional reading of the “great unread” (Cohen, 1999) facilitated by distant reading (Moretti, 2013). Our poster includes: an explanation of the linked data principles underlying the metadata aspects of the text mining tool, our entity reconciliation workflow, implications for library metadata and name authority practices in support of digital research projects, and an example of combined and enriched metadata for a work of eighteenth century literature, and an example of an iterative concept search and its output presented both as a static flowchart on the poster as well as an interactive prototype on a laptop.

Appendix A

  1. Bourg, Chris. (2017). Serendipity as prick (accessed 18 November 2017).
  2. Cohen, Margaret. (1999). The Sentimental Education of the Novel. Princeton, N.J.: Princeton University Press.
  3. De Bolla, Peter. (2013). The architecture of concepts : The historical formation of human rights (First ed.). New York: Fordham University Press.
  4. Folsom, Steven. (2017). New Models Require New Action Plans: Implementing Linked Data within the PCC. PCC (Project for Cooperative Cataloging) Strategic Planning Meeting Keynote, 1 November 2017. (accessed 27 November 2017).
  5. Hwang, Karen. (2015).  Enriching the Linked Jazz Name List with Gender Information (accessed 1 November 2017).
  6. Hwang, Karen. (2017). Using OpenRefine to Reconcile Name Entities (accessed 10 October 2017)
  7. Knoblock, C.A., et al. (2017). Lessons Learned in Building Linked Data for the American Art Collaborative. In: d'Amato C., et al. (eds) The Semantic Web – ISWC 2017 : 16th International Semantic Web Conference, Vienna, Austria, October 21-25, 2017, Proceedings, Part II (Lecture Notes in Computer Science, 10588). Cham: Springer International Publishing : Imprint: Springer.
  8. Mika, Katie. (2016). The Role of Librarians in Wikidata and Wikicite. (accessed 1 November 2017).
  9. MIT Libraries, Ad Hoc Task Force on the Future of Libraries. (2016). Institute-Wide Task Force on the Future of Libraries—Preliminary Report (accessed 25 November 2017).
  10. Moretti, Franco. (2013). Distant Reading. London: Verso.
  11. Tarver, Hannah, & Phillips, Mark. (2017). Identifier Usage and Maintenance in the UNT Libraries’ Digital Collections (accessed 27 November 2017).
  12. Van Hooland, Seth, & Verborgh, Ruben. (2014). Linked data for libraries : How to clean, link and publish your metadata. Chicago, IL: Neal-Schuman.
  13. Wones, Suzanne. (2017). Harvard Library Digital Strategy, Version 1.0. (accessed 10 November 2017).