Challenges in Enabling Mixed Media Scholarly Research with Multi-media Data in a Sustainable Infrastructure

Roeland Ordelman (rordelman@beeldengeluid.nl), University of Twente, Netherlands Institute for Sound and Vision and Carlos Martínez Ortíz (c.martinez@esciencecenter.nl), Netherlands Escience Center and Liliana Melgar Estrada (melgar@uva.nl), University of Amsterdam, Netherlands Institute for Sound and Vision and Marijn Koolen (marijn.koolen@huygens.knaw.nl), Huygens Institute for Netherlands History and Jaap Blom (jblom@beeldengeluid.nl), Netherlands Institute for Sound and Vision and Willem Melder (wmelder@beeldengeluid.nl), Netherlands Institute for Sound and Vision and Jasmijn Van Gorp (j.vangorp@uu.nl), Utrecht University and Victor De Boer (v.de.boer@vu.nl), Vrije Universiteit Amsterdam and Themistoklis Karavellas (tkaravellas@beeldengeluid.nl), Netherlands Institute for Sound and Vision and Lora Aroyo (lora.aroyo@vu.nl), Vrije Universiteit Amsterdam and Thomas Poell (t.poell@uva.nl), University of Amsterdam and Norah Karrouche (karrouche@eshcc.eur.nl), Erasmus University Rotterdam and Eva Baaren (ebaaren@beeldengeluid.nl), Netherlands Institute for Sound and Vision and Johannes Wassenaar (jwassenaar@beeldengeluid.nl), Netherlands Institute for Sound and Vision and Julia Noordegraaf (J.J.Noordegraaf@uva.nl), University of Amsterdam and Oana Inel (oana.inel@vu.nl), Vrije Universiteit Amsterdam

Providing scholarly access to large collections that are distributed across various content providers, and to create user-friendly applications to work with that data in a diversity of scholarly projects is the underlying goal of the development of the Common Lab Research Infrastructure for the Arts and Humanities (CLARIAH [1]).

Big-scale infrastructure projects in the humanities and social sciences such as the Digital Research Infrastructure for the Arts and Humanities (DARIAH) (Edmond et al., 2017), or the Common Language Resources and Technology Infrastructure (CLARIN) (Hinrichs and Krauwer, 2014) aim to provide solutions for both preservation and access to collections and data necessary for scholarly research (Zundert, 2012). Some infrastructure projects build decentralized “atomic” software services, e.g., as in the LLS infrastructure project (Buchler et al., 2016), while others prefer to build more centralized virtual research environments, as in the European Holocaust Research Infrastructure (EHRI) (Lauer, 2014). Also, even within a single infrastructure project, these two models can coexist. This is the case of the CLARIAH infrastructure, where different approaches have been taken to date for serving different user groups, i.e., several specialized tools for linguists (Odijk, Broeder & Barbiers, 2015), or a research environment (the Media Suite) that serves the scholarly needs for working with audiovisual data collections and related mixed-media contextual sources that are maintained at cultural heritage and knowledge institutions. This paper discusses the rationale and challenges behind the development of the Media Suite.

1. The CLARIAH Media Suite

Whereas in some domains of scholarly research the focus is on the creation of private data collections, in other domains scholarly research focuses on already established data collections maintained at heritage and knowledge institutions. Access to and use of these latter collections is often restricted, especially when they concern audiovisual media, due to intellectual property rights (IPR) or privacy issues (e.g., with respect to recorded interviews). Therefore, many scholars end up using collections that are openly available. Or, they spend considerable amounts of time in doing on-site visits to archives where data are available for consultation. Data collections at these institutes are “locked,” scattered, or at least hard to use for scholarly research.

To open up these collections for research and let scholars take advantage of the sheer quantity and richness of these data sets, we developed the Media Suite (Figure 1) [2], a research environment or high-level tool that works as a data aggregator where the metadata and the media content can be explored, browsed, analyzed, stored in personal collections, annotated manually, enriched automatically, and visualized, and where the user annotations can be exported.

Figure 1. The CLARIAH Media Suite’s homepage ( http://mediasuite.clariah.nl/ , version 2)

The ultimate goal of the Media Suite is to: (i) enable distant reading (Schulz, 2011), that is, identifying patterns or new research questions in all aggregated collections; (ii) facilitate close reading: the detailed examination of individual items (e.g., videos) in a collection or parts of these items (e.g., video segments) during search and scholarly interpretation, and (iii) make sure that the “scholarly primitives” (Unsworth, 2000, also described as an infrastructure framework in Blanke and Hedges, 2013) are well supported.

Even though these are accepted scholarly approaches that should be taken into account by infrastructure projects in the humanities nowadays, the question is: How to facilitate “close reading” when the media objects cannot be accessed because of copyright issues? How to enable “distant reading” when the content cannot be fully automatically processed or when their metadata is diverse and incomplete? How to cater to the needs of scholars with specific research questions and methods in the context of an infrastructure that has to be generic enough to be feasible?

2. Challenges and solutions

The approach of the CLARIAH Media Suite to tackle these challenges is: (i) to organise and implement a federated authentication mechanism to overcome access barriers (Figure 2, number 5), and (ii) to provide mechanisms that enable researchers to work with tools and aggregated data within the infrastructure. We refer to this approach as to “bringing the tools to the data”, as opposed to “bringing the data to the tools”.

Figure 2. The building blocks of the CLARIAH Media Suite

Figure 2 shows the main elements that constitute the Media Suite research environment. We explain below to which challenge is each element an answer to:

2.1. Data Sources -- Data Governance

Institutional collection maintainers have internal data governance processes to ensure that data assets are formally managed. One important aspect covered by governance processes is licensing: who has permission to access the data. However, data governance with respect to external processes --loosely defined as being part of an 'infrastructure'-- is typically not accounted for. This means that key data governance areas such as availability (e.g., metadata can be harvested), usability (e.g., source data can be viewed), integrity (e.g., protocols are in place to handle duplication and enrichment) and security (e.g., provenance information is maintained), need to be (re)organized or (re)considered, formalized and supported by the Media Suite and the emerging infrastructure in which it is embedded.

2.2. APIs -- Sustainable development

A digital infrastructure should use existing protocols, conventions, and standards. Besides obtaining data by harvesting using the OAI-PMH protocol, or using APIs, the functionalities have been organised in a modular approach, which includes (Matínez et al., 2017):

●Components: which use the API’s to perform specific tasks.

●Tools: which incorporate a number of components in a tool.

Moreover, all components and tools developed in the project are open source. In addition, the Media Suite offers public APIs, which provide mechanisms for software programmers to create functionalities using API building blocks or components. We offer a Collection API, a Search API, and an Annotation API, which provides functionality for adding data annotating existing data, using the W3C Web Annotation data model (Sanderson et al., 2017).

2.3. Components/Tools -- User friendly interaction design

Developing new tools “from scratch” would be a very inefficient (and costly!) endeavour. The digital infrastructure should provide tools that are suitable both for common scholarly tasks, and for specific tasks required by each discipline. However, the digital humanities community incorporates a wide diversity of scholars with different research questions, methods, and levels of expertise in working with information processing techniques and technologies. As every infrastructure, we also have to tackle “the generalization paradox” (Zundert, 2012). We address this challenge by (i) focussing on the similarities in research methods from different disciplines (e.g., De Jong, Ordelman, Scagliola, 2011; Melgar et al., 2017), (ii) analyzing tools that support qualitative methods (Melgar & Koolen, 2018), and (ii) working with scholars as co-developers in the process [3]. The resulting functionalities are built in a modular (lego) approach that supports both flexible software development of components and user friendly interaction with assembled tools. A current challenge is to provide entity-based browsing (Verhoeven and Burrows, 2017) of both linked open data collections (RDF) and tabular data via an exploratory browser (see De Boer et al., 2017).

2.4. Enrichment services and Work Space -- Working with audio-visual content and private data

In addition to IPR and privacy restrictions, access to the audiovisual content in the Media Suite is also limited due to its nature; consisting of pixels (video) and samples (audio) and some manually generated metadata or subtitles (text). Typically, scholars want to search audiovisual data using (key)words that may be ‘hidden’ (encoded) in the pixels or the samples. This is called the semantic gap (Smeulders, 2000) that needs to be “bridged” by decoding the information in the pixels and the samples to semantic representations, e.g., a verbatim transcription of the speech or labels of visual concepts in the video (a car, a face, the Eiffel Tower), that can be matched with the keywords from the scholars. These semantic representations can be generated manually or, especially when data collections are large, automatically using automatic speech recognition (ASR) or computer vision technology.

The generation of semantic representations is addressed in different ways. On the one hand, we are currently developing an ASR service that resides within the CLARIAH infrastructure that can handle requests from the infrastructure itself (e.g., to process a data set that exists within the infrastructure), but also request from individual scholars that want to process their private collections. On the other hand, supporting manual annotation is key for interpretation in scholarly contexts (Melgar et al., 2017). The Media Suite aims to support the generation of both ways of semantic representations in complementary ways via information workflows centered around a “Work Space” (see Figure 3) which stores private session data and enables collaboration .

Figure 3. The CLARIAH Media Suite’s Workspace

3. Conclusions and Future work

The paper describes the challenges found in building a sustainable, dynamic, multi-institutional infrastructure that can properly serve media scholars and digital humanists in general. We choose the approach of building a research environment that adheres to infrastructural requirements while at the same time being flexible and user-friendly. In order to ensure its used and further development after the project's lifetime, we need to carefully align the requirements of scholars with the context of the ecosystem the Media Suite needs to live in: an ICT infrastructure hosted and maintained by multiple institutions that in turn, adheres to a diverse set of institutional requirements with respect to, for instance, data access permissions and software development and maintenance. The Media Suite is currently functional and used by scholars doing actual research projects and will be developed further, e.g., by incorporating additional data sources (e.g., social media data), increasing metadata granularity (e.g., adding computer vision or emotion recognition), adding advanced annotation tools, and supporting missing data visualization (data critique) for heterogeneous datasets.

[1] https://clariah.nl/

[2] The first of a four release versions was introduced in April 4, 2017.

[3] Indeed, an adopted strategy at the CLARIAH project level, has been to offer grants to scholars to conduct small scale research pilot projects using the CLARIAH infrastructure. In the media studies focus that we describe in this paper, almost ten scholars participate as co-developers. We follow an Agile methodology for implementation, which despite criticisms has proved to be useful for this type of projects (van Zundert, 2012)


Appendix A

Bibliography
  1. Blanke, T., & Hedges, M. (2013). Scholarly primitives: Building institutional infrastructure for humanities e-Science. Future Generation Computer Systems, 29(2), 654–661.
  2. Buchler, M., Franzini, G., Franzini, E., & Eckart, T. (2016). Mining and analysing one billion requests to linguistic services (pp. 3230–3239). Presented at the Proceedings - 2016 IEEE International Conference on Big Data, Big Data 2016.
  3. De Boer, V., Melgar Estrada, L., Inel, O., Martínez Ortiz, C., Aroyo, L., & Oomen, J. (2017). Enriching Media Collections for Event-based Exploration. Presented at the 11th International Conference on Metadata and Semantics Research, Tallinn, Estonia.
  4. De Jong, F., Ordelman, R., & Scagliola, S. (2011). Audio-visual Collections and the User Needs of Scholars in the Humanities: a Case for Co-Development. Presented at the 2nd Conference on Supporting Digital Humanities (SDH 2011), Copenhagen, Denmark.
  5. Edmond, J., Fischer, F., Mertens, M., & Romary, L. (2017). The DARIAH ERIC: Redefining Research Infrastructure for the Arts and Humanities in the Digital Age. ERCIM News, (111). Retrieved from https://hal.inria.fr/hal-01588665/document
  6. Krauwer, S., & Hinrichs, E. (2014, May). The CLARIN Research Infrastructure: Resources and Tools for e-Humanities Scholars. LREC-2014, pp. 1525 - 1531.
  7. Lauer, G. (2014). Challenges for the Humanities: Digital Infrastructures. In A. Duşa, D. Nelle, G. Stock, & G. G. Wagner (Eds.), Facing the Future : European Research Infrastructures for the Humanities and Social Sciences. Berlin: Scivero Verlag.
  8. Martínez Ortiz, C., Ordelman, R., Koolen, M., Noordegraaf, J., Melgar Estrada, L., Aroyo, L., … Poell, T. (2017). From Tools to “Recipes”: Building a Media Suite within the Dutch Digital Humanities Infrastructure CLARIAH. Presented at the Digital Humanities Benelux, Utrecht.
  9. Melgar Estrada, L., & Koolen, M. (2017). Audiovisual media annotation using Qualitative Data Analysis Software: a comparative analysis. The Qualitative Report.
  10. Melgar Estrada, L., Koolen, M., Huurdeman, H., & Blom, J. (2017). A process model of time-based media annotation in a scholarly context. In CHIIR 2017: ACM SIGIR Conference on Human Information Interaction and Retrieval. Oslo.
  11. Odijk, J., Broeder, D., & Barbiers, S. (2015). CLARIAH Linguistics Plan (v0.95). Retrieved from https://clariah.nl/werkpakketten/focusgebieden/taalkunde
  12. Sanderson, R., Ciccarese, P., & Young, B. (Eds.). (2017). Web Annotation Data Model: W3C Recommendation 23 February 2017. W3C.
  13. Smeulders, A. W. M., Worring,  M. , Santini, S. ,Gupta A. , and Jain, R. (2000), Content-based image retrieval at the end of the early years. In IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, no. 12, pp. 1349-1380, Dec 2000.
  14. Unsworth, J. (2000). Scholarly primitives: What methods do humanities researchers have in common, and how might our tools reflect this? Presented at the Symposium on “Humanities Computing: formal methods, experimental practice,” London: King’s College.
  15. Van Zundert, J. (2012). If You Build It, Will We Come? Large Scale Digital Infrastructures as a Dead End for Digital Humanities. Historical Social Research / Historische Sozialforschung, 37(3 (141)), 165–186.
  16. Verhoeven, D., & Burrows, T. (2015). Aggregating Cultural Heritage Data for Research Use: the Humanities Networked Infrastructure (HuNI). In Metadata and Semantics Research (pp. 417–423). Springer, Cham.