Modeling the Genealogy of Imagetexts: Studying Images and Texts in Conjunction using Computational Methods

Melvin Wevers (, DH Group, KNAW Humanities Cluster and Thomas Smits (, Institute for Historical, Literary and Cultural Studies, Radboud University and Leonardo Impett (, Digital Humanities Institute, EPFL

In his influential article “There are no Visual Media”, theorist of visual culture W.J.T Mitchell argues that “all media are mixed media” (Mitchell, 2005). In earlier work, Mitchell already noted that composite works—media formats that consist of both image and text—cannot be adequately studied by comparing the meaning of these two forms of representation separately (Mitchell, 1994, p. 89). The subject matter of these “imagetexts”, is, rather, the “whole ensemble of relations between media” (Mitchell, 1994, p. 89). In other words, the meaning of one of the components of an imagetext, be it either the image or the text, can only be understood in relation to the other. This paper combines methods from text mining, computer vision, and information theory to increase our understanding of this relationship throughout several historical datasets.

Several scholars have observed that Digital Humanities research mainly focuses on (large-scale) textual analysis (Champion, 2017; Meeks, 2013). Erik Champion, for instance, notes that the influential definition of Digital Humanities by the University of Oxford is entirely “text based and desk based” (Champion, 2017, p. 25). While he rightly claims that research in the Digital Humanities is centered on text, in recent years an increasing number of researchers have started studying visual material, in which has been called “visual big data” (Ordelman et al., 2014; Smith, 2013). Scholars increasingly rely on computational methods to analyze these large digitized visual datasets in innovative ways. Important examples are the work of Seguin (Seguin et al., 2017) on visual pattern discovery in large databases of paintings, Impett and Moretti’s (Impett and Moretti, 2017) large-scale analysis of body postures in Aby Warburg’s Atlas Mnemosyne, and Wevers’ (Wevers and Lonij, 2017) and Smits’s (Smits, 2017; Smits and Faber, 2018) analysis of visual trends in advertisements and images in newspapers. These projects were all presented at DH2017, some during the well-attended pre-conference workshop of the Special Interest Group AudioVisual Material in Digital Humanities (AVinDH).

The recent upsurge of large-scale analysis of visual material shifts the focus in Digital Humanities research away from texts. However, this has also led researchers to approach text and images as disjointed entities. Computational techniques can analyze similarity and change in both textual and visual discourse. Our project applies techniques from both textual and visual computational analysis to a dataset of advertisements for cars extracted from the widely-read Dutch newspaper De Volkskrant between 1945 and 1995, which we extracted from the large collection of digitized newspapers maintained by the National Library of the Netherlands. By juxtaposing change points in text and visual material, we show that the meaning of imagetexts can be studied by looking at the relation between the two forms of representation. Put differently, how does change and continuity in the visual correspond to changes in the textual and vice versa?

Using Kleinberg's burst algorithm, we detected bursty words in the textual content of advertisements (Kleinberg, 2002). These bursts indicate possible change points in advertising discourse that call for closer examination of the advertisements and can be cross-examined with possible changes in the visual content. Also, topic modeling (LDA) was used to detect clusters of advertisements based on textual context. These clusters were compared to cluster based on visual aspects.

Trends, similarities, and points of inflection in the image sets will be traced using a subspace learned by training a Generative Adversarial Network (GAN; see Goodfellow et al. 2016), which has been shown to generate semantically-meaningful vector subspaces. GANs work best with regular sets of images - our visual analysis process is thus twofold. First, we use a pretrained Mobilenet CNN (Howard et al. 2017) to detect objects (cars, trucks, people, etc), and then train individual GANs to explore the visual-semantic space of each object through time.

Whereas a traditional CNN can only encode from image to vector, a GAN can also decode from any vector to generate artificial images; trends or clusters hypothesized in a vectorial subspace can therefore be subjected to a ‘close reading’ of the corresponding artificial images. This generative hermeneutic avoids the ‘black box’ nature of traditional neural network image analysis.

The ability to detect how changes and continuity between text and images correlate increases our understanding of the function of imagetexts in modern culture. It also helps us understand whether the relationship between the two forms of representation became more entangled over time, or whether this entanglement is specific to particular products or specific periods.

Appendix A

  1. Champion, E.M. (2017), “Digital humanities is text heavy, visualization light, and simulation poor”, Digital Scholarship in the Humanities, Vol. 32 No. 1 sup, pp. i25–i32.
  2. Howard, A.., et al. (2017) "Mobilenets: Efficient convolutional neural networks for mobile vision applications." arXiv preprint arXiv:1704.04861.
  3. Impett, L. and Moretti, F. (2017), “Totentanz”, New Left Review, No. 107, pp. 68–97.
  4. Kleinberg, J. (2002), “Bursty and Hierarchical Structure in Streams”, P roceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, Edmonton, Canada (2002), pp. 91–101.
  5. Meeks, E. (2013), “Is Digital Humanities Too Text-Heavy?”, Digital Humanities Specialist.
  6. Mitchell, W., 1994. Picture Theory. Essays on Verbal and Visual Representation, University of Chicago Press, Chicago.
  7. Mitchell, W.J.T. (2005), “There Are No Visual Media”, Journal of Visual Culture, Vol. 4 No. 2, pp.257–266.
  8. Ordelman, R., Kleppe, M., Kemman, M. and De Jong, F. (2014), “Sound and (moving images) in focus – How to integrate audiovisual material in Digital Humanities research”, ADHO 2014.
  9. Seguin, B., di Leonardo, I. and Kaplan, F. (2017), “Tracking Transmission of Details in Paintings”, ADHO 2017.
  10. Smith, J.R. (2013), “Riding the Multimedia Big Data Wave”, Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, New York, NY, USA, pp. 1–2.
  11. Smits, T. (2017), “Illustrations to Photographs: using computer vision to analyse news pictures in Dutch newspapers, 1860-1940”, ADHO 2017.
  12. Smits, T., Faber, W.J. (2018), “ CHRONIC (Classified Historical Newspaper Images)”, KB Lab, 21 March,
  13. Wevers, M. and Lonij, J. (2017), “Siamese”, KB Lab, 15 October,