Computer Vision in DH

Lauren Tilton (ltilton@richmond.edu), University of Richmond, United States of America and Taylor Arnold (tarnold2@richmond.edu), University of Richmond, United States of America and Thomas Smits (t.smits@let.ru.nl), Radboud University, The Netherlands and Melvin Wevers (melvinwevers@gmail.com), Digital Humanities Group, KNAW Humanities Cluster, Amsterdam, the Netherlands and Mark Williams (Mark.J.Williams@dartmouth.edu), Dartmouth College, University States of America and Lorenzo Torresani (LT@dartmouth.edu), Dartmouth College, University States of America and Maksim Bolonkin (mbolonkin@cs.dartmouth.edu), Dartmouth College, University States of America and John Bell (john.p.bell@dartmouth.edu), Dartmouth College, University States of America and Dimitrios Latsis (dlatsis@ryerson.ca), Ryerson University, Canada

1. Overview

Visual culture is often overlooked as an object of study within the digital humanities (DH). Yet, visual culture is central to fields such as art history, cultural studies, history, and media studies.  Cultural forms such as drawings, film, video, painting, photography and drawing continue to shapes culture, politics and society. As the field begins to turns its attention to other forms of media, new methods and tools are necessary to study visual culture at scale. Recent advances in computer vision are proving a promising direction. The panel "Computer Vision in the Digital Humanities" will present three approaches to using computer vision in the Digital Humanities (DH).

"Distant Viewing: Analyzing Moving Images at Scale" argues that computer vision is a powerful tool for distant viewing time-based media. The authors will outline their method and then describe the Distant Viewing Toolkit, a set of machine learning computer vision algorithms for analyzing features such as color, shot and scene breaks, and object identification such as faces. They will then turn to a case study of the American Network Era (1952-1984) television to show how their method reveals the representational politics of gender during the era.  

“Seeing History: Analyzing Large-Scale Historical Visual Datasets Using Deep Neural Networks” will then focus in on how convolutional neural networks (CNN) can be used for historical research. The authors focus on two case studies applied to two major Dutch national newspapers. They used CNNs to identify over 400,000 advertisements from 1945-1995 in the first study and  over 110,000 photographs and drawings from 1860 to 1920 in the second study. They then will explain the two tools they developed to support visual and textual search in the new corpuses.

Finally, The Media Ecology Project and Visual Learning Group are creating software that allows people to search in untagged films and videos in the same way that they search through the text of a document. The tool takes search queries expressed in textual form and automatically translates them into image recognition models that can identify the desired segments in the film. The image recognition results can be cached for quick searching.  Initial prototype results, funded by The Knight Foundation, focused on educational films at The Dartmouth Library and The Internet Archive that are common to many libraries and archives.

All of the papers will address the need to develop open access historical humanities data sets for developing computer vision as a DH technique.  

2. Distant Viewing: Analyzing Moving Images at Scale

Digital humanities' (DH) focus on text and related methodologies such as distant reading and macroanalysis has produced exciting interventions (Jockers 2013; Moretti 2013).  However, there is an increasing call to take seriously visual culture and moving images as objects of study in digital humanities (Posner 2013; Acland and Hoyt 2016; Manovich 2016; ADHO AVinDH Special Interest Group). In this paper, we will discuss how we are using computer vision and machine learning to distant view moving images.

The paper will begin by outlining our method of distant viewing and then turn to our Distant Viewing toolkit. Using and developing machine learning algorithms, the toolkit analyzes the following features: (1) the dominant colors and lighting over each shot; (2) time codes for shot and scene breaks; (3) bounding boxes for faces and other common objects; (4) consistent identifiers and descriptors for scenes, faces, and objects over time; (5) time codes and descriptions of diegetic and non-diegetic sound; and (6) a transcript of the spoken dialogue (see Figures 1 & 2 for examples of these analyses). These features serve as building blocks for analysis of moving images in the same way words are the foundation for text analysis. From these extracted elements, higher-level features such as camera movement, framing, blocking, and narrative style can then be derived and analyzed. These techniques then allow scholars to see content and style within and across moving images such as films, news broadcasts, and television episodes, revealing how moving images shape cultural norms.

To illustrate this approach, we have applied our Distant Viewing toolkit to a collection of series from the Network Era (1952-1984) of American television. The Network Era is often considered formulaic and uninteresting from a formal perspective despite how highly influential this era of TV was on U.S culture (Spiegel 1992). Our analysis challenges this characterization using computational methods by showing how the formal elements of the sitcoms serve to reflect, establish, and challenge cultural norms. In particular, we will focus on the representational politics of gender during the Network Era.  For examples of how we are distant viewing TV, please see distanttv.org.

Shots detected by the Distant Viewing Toolkit on an episode of I Dream of Jeannie.
Figure 1. Shots detected by the Distant Viewing Toolkit on an episode of I Dream of Jeannie.

Most similar faces to the reference character, upper left-most frame, from detected faces in a season of episodes from the series I Dream of Jeannie.
Figure 2. Most similar faces to the reference character, upper left-most frame, from detected faces in a season of episodes from the series I Dream of Jeannie.

3. Seeing History: Analyzing Large-Scale Historical Visual Datasets Using Deep Neural Networks

Scholars are increasingly applying computational methods to analyze the visual aspects of large-scale digitized visual datasets (Ordelman et al., 2014). Inspiring examples are the work of Seguin (Seguin et al., 2017) on visual pattern discovery in large databases of paintings and Moretti’s and Impett’s (Moretti and Impett, 2017) large-scale analysis of body postures in Aby Warburg’s Atlas Mnemosyne. In our paper, we will present two datasets of historical images and accompanying texts harvested from Dutch digitized newspapers and reflect on ways to improve existing neural networks for historical research. We will discuss how large historical visual datasets can be used for historical research using neural networks. We will do this by describing two case studies, and will end our paper by arguing for the need for a benchmarked dataset with historical visual material.

The sets were produced during two researcher-in-residence projects at the National Library of the Netherlands. The first set consists of more than 400,000 advertisements published in two major national newspapers between 1945 and 1995. Using the penultimate layer in a Convolutional Neural Network (CNN), 2,048 visual aspects were abstracted from these images, which can be used to group images together (Seguin et al., 2017). The second dataset includes about 110,000 classified images from newspapers published between 1860 and 1920. The images were classified using a pipeline that consists of three classifiers. The first one detects images with faces (Geitgey, 2017), the second categorizes images according to eight different categories (buildings, cartoons, chess problems, crowds, logos, schematics, sheet music, and weather reports), and the last one sorts images as either photographs or drawings (Donahue et al., 2013).

We developed two tools to query these datasets. The first tool offers exploratory search in the advertisement dataset, which enables users to find images sharing a degree of visual similarity and can be used to detect visual trends in large visual datasets. The second one enables users to find images in the second set by searching for specific (combinations) of visual subjects and keywords. For example, images of ‘buildings’ with ‘faces’ and the keyword ‘protest’ in the text.

Finally, our paper discusses several challenges and possibilities of computer vision techniques for historical research. Most CNN’s are trained on contemporary materials (ImageNet). As a result, these networks perform well in recognizing the categories of the ImageNet challenge. However, the fact that they were trained on contemporary data can cause problems when working with historical images. For example, detecting bicycles works relatively well because the design of the bicycle has remained more or less similar during the last century, while trains are much more difficult since they have changed significantly over the years. Also, models trained on ImageNet have difficulties detecting objects in illustrations, which are often used in newspapers. They are regularly classified within the uninformative category ‘cartoon.’ In short, we will discuss how to improve these models and argue for the development and benchmarking of datasets with visual historical material.

4. Unlocking Film Libraries for Discovery and Search

Where the library of the 20th century focused on texts, the 21st century library will be a rich mix of media, fully accessible to library patrons in digital form. Yet the tools that allow people to easily search film and video in the same way that they can search through the full text of a document are still beyond the reach of most libraries. How can we make the rich troves of film/video housed in thousands of libraries searchable and discoverable for the next generation?

Dartmouth College’s Media Ecology Project, led by Prof. Mark Williams and architect John Bell, and the Visual Learning Group, led by Prof. Lorenzo Torresani, are applying computer vision and machine learning tools to a rich collection of films held by Dartmouth Library and the Internet Archive. Using existing algorithms, we describe what is happening and translate the resulting tags into open linked data using our Semantic Annotation Tool (SAT). SAT provides an easy-to-use and accessible interface for playing back time-based annotations (built upon W3C web annotation standards) in a web browser, allowing simple collection development that can be integrated with discovery and search in an exhibition. What was once a roll of film, indexed only by its card catalog description, will become searchable scene-by-scene, adding immense value for library patrons, scholars and the visually impaired.

Dartmouth College’s Visual Learning Group is already a leader in computer vision and machine learning, developing new tools for object and action recognition. This project has brought together cross-curricular groups at Dartmouth to collaborate on applying modern artificial intelligence and machine learning to historic film collections held by libraries and archives.

Our tool takes search queries expressed in textual form and automatically translates them into image recognition models that can identify the desired segments in the film. The entire search takes only a fraction of a second on a regular computer.  We have a working prototype of the search functionality and are creating a demonstration site that will be featured in the conference presentation. Our initial prototype results, funded by The Knight Foundation, focused on educational films at The Dartmouth Library and The Internet Archive that are common to many libraries and archives.  Our software leverages image recognition algorithms to enable content-based search in video and film collections housed in libraries. By utilizing The Semantic Annotation Tool, the project also works to bring together human- and machine-generated metadata into a single, searchable format.

By improving the cutting edge algorithms used to create time-coded subject-tags (e.g. http://vlg.cs.dartmouth.edu/c3d/), we aim to lay the foundation for a fully-searchable visual encyclopedia and to share our methods and open source code with film libraries and archives everywhere. Our goal is to unlock the rich troves of film held by libraries and make them findable and more useable—scene by scene, and frame by frame--so future generations can discover new layers of meaning and impact.


Appendix A

Bibliography
  1. Acland, C. R. and Eric Hoyt , Editors (2016).The Arclight Guidebook to Media History and the Digital Humanities. REFRAME Books.
  2. Bertasius, G., Shi, J. and Torresani, L. (2015) “High-for-Low and Low-for-High: Efficient Boundary Detection from Deep Object Features and its Applications to High-Level Vision,” in IEEE International Conference On Computer Vision, ICCV.
  3. Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E. and Darrell, T. (2013), “DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition”, ArXiv:1310.1531 [Cs], available at: http://arxiv.org/abs/1310.1531 (accessed 23 November 2017).
  4. Geitgey, A. (2017), Face_recognition: The World’s Simplest Facial Recognition Api for Python and the Command Line, Python, , available at: https://github.com/ageitgey/face_recognition (accessed 23 November 2017).
  5. Hediger, V. and Vonderau, P. (2009) Films that Work: Industrial Film and the Productivity of Media (Film Culture in Transition) Amsterdam: Amsterdam University Press.
  6. Jockers, M. L. (2013). Macroanalysis: Digital methods and literary history. University of Illinois Press, 2013.
  7. Manovich, L. (2016). “The Science of Culture? Social Computing, Digital Humanities, and Cultural Analytics”. The Datafied Society. Social Research in the Age of Big Data, edited by Mirko Tobias Schaefer and Karin van Es. Amsterdam University Press.
  8. Moretti, F. (2013) Distant Reading. Verso Books.
  9. Moretti, F. and Impett, L. (2017), “Totentanz”, New Left Review, No. 107, pp. 68–97.
  10. Ordelman, R., Kleppe, M., Kemman, M. and De Jong, F. (2014), “Sound and (moving images) in focus – How to integrate audiovisual material in Digital Humanities research”, presented at the Digital Humanities 2014, Lausanne, available at: (accessed 15 November 2017).
  11. Posner, M. ( 2013). "Digital Humanities and Film and Media Studies: Staging an Encounter." Workshop della Society for Cinema and Media Studies Annual Conference, Chicago. Vol. 8.
  12. Seguin, B., di Leonardo, I. and Kaplan, F. (2017), “Tracking Transmission of Details in Paintings”, presented at the Digital Humanities 2017, Montreal.
  13. Spigel, L. (1992). Make Room for TV: Television and the Family Ideal in Postwar America. University of Chicago Press.
  14. Tran, D., Bourdev, L. , Fergus, L. , Torresani, L. and Paluri, M. (2015) . “Learning Spatiotemporal Features with 3D Convolutional Networks,” in IEEE International Conference On Computer Vision, ICCV.
  15. Williams, M. (2016). “Networking Moving Image History: Archives, Scholars, and the Media Ecology Project” in The Arclight Guidebook to Media History and the Digital Humanities, Charles R. Acland and Eric Hoyt, eds.