Beyond Image Search: Computer Vision in Western Art History
The Digital Humanities is largely a textual field. In part, this is a reflection of the supremacy of the written word in wider academic production; in part, of the history of the Digital Humanities as an interdiscipline in the study of textual sources, from poems to administrative records. As a community, we have become proficient in the creation of images – thinking and even programming through diagrams, visualising the results. The study of images, on the other hand, has had no such computational revival.
Many of these impasses are at least as much a product of technical difficulty as of intellectual habit or institutional inertia. Nelson Goodman (1968) noted that allographic arts follow kinds of notational systems – from poetry in the Japanese alphabets to modernist dance in Labanotation. The dynamics of such cultural phenomena may be far from the statistics of these notations, but these symbolic abbreviations give a first way in for the use of computational techniques in cultural history. The visual arts, on the whole, have no such notational projection. Nonetheless, recent advances in Computer Vision techniques (explored by last year’s AVinDH SIG workshop) allow us to confront the anti-notational head on, fuelled by the genuinely ‘big’ data from mass cultural digitisation programs – PHAROS alone will soon publish 31 million images of Western art and architecture.
The question of symbols in the computational study of images opens up interesting questions with profound epistemological consequences. Should the question of the symbolic be avoided (through image search engines), or re-confronted (the invention of new notational systems)? Our panel aims to bridge a range of views on such matters, exhibiting radical work which is at once critically provocative and technically cutting-edge. Such dissonances echo historiographical debates over the role of the symbolic and the iconic.
Our panel attempts to examine such questions precisely in the contexts in which they are least accepted: in Medieval and Renaissance art history – a highly institutionalised and specialised discipline, compared to cinema and visual studies or Bildwissenschaft. Despite the high degree of technological involvement in digitisation, archiving and exhibition, art history has had almost no computational interventions in criticism – controversial and pioneering work in the computational analysis of Renaissance images by Robert Tavernor (1995) and Martin Kemp (Criminisi, Kemp and Zisserman 2005) are the exceptions that highlight the rule.
Much ‘Digital Art History’ concerns questions of linked archival metadata, digital publishing, or image digitisation; important questions, but which already have active DH communities. In an institutional sense, then, this panel implies a shift of the computational, from the world of GLAMs to those of university research departments.
The use of computational techniques in art history further opens up important questions in the blurring of relationships between art history, artistic criticism, artistic research and artistic production. Here again, the panel seeks to offer the whole horizon of intellectual opinions, none of which are to be merely dismissed as traditionalist: from the creation of ready-made software to be used as a tool by individual art historians, to algorithmic criticism presented as artistic practice in its own right.
Digital Gesturing in Early Renaissance Italy
The gestures of early Renaissance Italian art are largely read through three lenses: iconography (Garnier 1982 or Barasch 1987), classical rhetoric (van Eck 2007), or universalist theories of expression (Freedberg 2007). This work focuses on a fourth view, complementary rather than contradictory: that of social life, of theatre and sermons, of dance and jesting, of insults and educational manuals.
The computer vision techniques I have developed over the past two years make it possible to automatically identify gesture from images of paintings and reliefs: from precise hand-shapes (e.g. the ‘corni’), to body poses (‘genuflection’) and more subtle body language (‘slouching’). We can visualise patterns, clusters and trends within ‘gesture-space’. Such computations, through their blindness to non-gestural properties (gender, age, musculature, iconography), often propose novel, radical links that are both morphologically precise and visually estranged – leading to an epistemological understanding of this work from Brecht’s Verfremdungseffekt and theory of Gestus.
The period of enquiry, 1300-1480, is chosen to bridge the gap between the two longue durée historical anthropologies of gesture: that of Jean-Claude Schmitt (1990) ending in c.1300, and of Peter Burke starting from c.1500 (1991). However, through such gesture-computational techniques, I give a primary role to the image, both as object of study and source of historical evidence. Such an approach is particularly appropriate to the gestural culture of the early Renaissance; as Dilwyn Knox (1991) has noted, this period is notable for its relative lack of universalist theories on gesture.
My digital approach uniquely allows for the inclusion of a great number of minor works. To study a gesture amongst tens of thousands of images is not just to operate on a different scale to what is possible ‘by hand’, but also to change the object of study: to include a critical mass of works for humbler patrons. My approach, however, is not a kind of ‘distant reading’ in a statistical sense; as I have argued in detail elsewhere, broad statistical statements on such collections are scientifically unjustified. Rather, I use my tools to identify groups of images containing specific gestures or poses: visualising them, curating a small linked database of gesture-images, and examining (including through primary texts) conventional aspects of context, purpose and connotation in detail.
In particular, I focus on gestural implications of the Trecento plagues, drawing on scholarship in its implications for general and medical thinking of the body. Specifically, I consider parallels between the gestures found in Trionfi della Morte and the extreme gestures described by Barasch (1976) in depictions of sorrow or apocalypse. The study of the gestures of the plague, especially in its considerations of post-Galenic / pre-Cartesian understandings of mental and physical health through humourism, will significantly inform more general questions of ‘emotion’ vs. ‘character’ in XIV-XV century gesturality, an opposition which I seek to problematise.
Finally, on a methodological point, I hope to demonstrate the possibility of computer-aided art history which isn’t based on rigid universalist textual taxonomies (which I see as repeating early 1990s mistakes in ‘Expert Systems’ AI), neo-formalist computational iconography, or automaton connoisseurship.
Exploring Large Art Historical Photo Collections
In recent years, museums and institutions have spearheaded global open-access efforts leading to the digitization of many artworks (paintings, engravings, sketches, old photographs etc.). Leading that trend is the PHAROS Consortium, which includes the biggest photographic-archives in the world, aiming to place online in the upcoming years, roughly 31 million images. Through the IIIF standard (International Image Interoperability Framework), these images are now available online in an easily accessible manner, with each institution/museum opening its content with standard APIs. This brings an unprecedented opportunity for interactions and global search across collections.
Using the case of the digitized photo-collection of the Cini (330’000 elements), we will focus on two cases where Computer Vision is very beneficial: duplicate detection and pattern tracking. The first task is extremely beneficial to reconcile different collections and detect conflicting metadata, while the second one is more akin to a form of visual search. For both tasks, we found cr
ucial however, to have a good interface to explore the collection and acquire training examples in order to solve both problems.
An important concern when designing our search interface was to bring as much freedom as possible to the researcher. In addition to a traditional textual query based on metadata, we have two types of visual query, which themselves can be combined with metadata filters. Additionally, two types of visualization allow to explore search results in different ways, and save connections between images.
A visual search system is tightly coupled with an underlying visual similarity. However, an iconographic similarity is not the same as the similarity of two paintings made by the same artist, but representing different subjects. The central question of what sort of results the researcher is looking should be part of every project involving visual search. Here, we propose to tackle it by having the users edit a graph of visual connections between the images, which allows the search system to train itself on what to retrieve from these very large databases. For example, such a system allowed us to be able to retrieve drawings or engravings from a picture of the original painting, or the reuse of a pattern by followers of an artist, all without looking at any single metadata entry.
Iconography, Pose Recognition and a Grammar of Gestures
Description and iconography are first and main steps of an art historical analysis. Art history has forever created taxonomies and encyclopedias for iconography. Therefore, many images of Western Art can be easily identified and categorized. In case of very common representations, like the prominent scenes of the gospels, this amounts to iconographic categorization – these are present as metadata in art historical image repositories, but remain too broad to represent the variations of motifs and historical change. Diachronic links (for example between medieval art and the ‘Nazarener’) are not necessarily connected.
To visualize these differences and connections, we follow a strictly form-oriented anthropocentric approach, especially focused on the positions, gestures and interactions of figures. State of the art algorithms are able to handle this anthropocentric approach by detecting the position of the main figures of an image truthfully. From about 60 illustrated gospel scenes of the life of Jesus, we chose two very prominent narratives as case studies for this approach: the annunciation and the baptism. Whereas the baptism is normally presented in one central moment, Michael Baxandall has shown that in the annunciation at least five different phases can be differed. Therefore we can show the variations in the composition and gestures of one central moment and the movements within a certain plot. Details like appearance, the use of different objects (i.e. shell or bowl for baptism), and the composition of the images can be analysed easily via this method. We do this comparison with about a thousand images for each prominent scene, using neural-network-based full body recognition algorithms and nonlinear dimensionality reduction.
The goal of the project is to analyse a huge corpus of Christian art, detecting gospel scenes from the life of Jesus, and to compare each single scene in its own tradition, within its plot and with every other scene. This method helps us to visualize what ‘iconography’ really consists of, and how it changes in time. It helps to reconstruct the inner relationship between motifs and narrative structures, and finally opens a close reading of gospels, apocrypha and other theologian writings (and concepts) which are the basis of the representations. Instead of being an image search, the aim of the approach is to visualize the similarities in a plot, so that structures and clusters can be interpreted from a distance and closely connected images can be analysed in detail. Finally, the approach helps to understand the use of nonverbal language in art and reconstruct the changes of the concepts of human body and interaction.
Understanding Art: Distant Viewing Meets Close Reading
Computer vision and object retrieval are well suited for understudied, large, and, until now, untagged image datasets containing repetitive or similar motifs. Visual elements can be directly retrieved by means of computer vision, enabling specific search functions and reducing the need for textual annotation of the digitized data, which can be a costly part of database projects. Especially in the case of large numbers of equal or similar subjects, computer-based algorithms can probe images and provide surveys and information via statistical visualizations that shed new light on the material. In addition, by assembling series of images based on similarities, computer-based analysis can facilitate attribution to an artist, a date or dependent art works.
Furthermore, the content and composition within the images can be analyzed by detecting objects and their respective locations. The development of computer-based algorithmic image analysis will force us to reconsider the role of human connoisseurship; but first and foremost computer vision can assist by processing the enormous visual data of cultural heritage. On the one hand the variety of cultural heritage across societies, various media and materials requires the flexible visual search. On the other hand cultural heritage is however based on many canonic pictorial subjects, common symbols, scriptures, and icons as well as concrete objects and standardized ornaments. Therefore there is also a strong necessity for a search algorithm which can retrieve specific objects, subjects and characters.
The paper explores the opportunities to support the research of scholars from various fields. Art history as well as archeology, visual studies and other image oriented research fields can benefit from visual retrieval not only to find details and objects but also to analyze their similarity, strategies of variation and reproduction and visual as well as semantic connections. Computer vision benefits from a challenging problem in the humanities that stimulates advances in computer vision, leading to improvements in the understanding of visual representations.
Nevertheless there are restrictions in the computer vision of art. The semantic gap and elaborate concepts of iconography and iconology as well as the various shades of attribution in the field of connoisseurship are challenging problems. Visuals similarity itself has incommensurable and contradictory definitions: through motif, composition, style, technique. The paper points out the boundaries of computer vision concerning art and reflects on possible solutions.
Barasch, M., (1976). Gestures of despair in medieval and early Renaissance art . New York: New York University Press.
Barasch, M., (1987). Giotto and the Language of Gesture. Cambridge: Cambridge University Press.
Burke, P., (1991). The language of gesture in early modern Italy. In Bremmer, J and Roodenburg, H. (eds), A cultural history of gesture, pp.71-83. Cornell University Press.
Criminisi, A., Kemp, M. and Zisserman, A., (2005). Bringing pictorial space to life: computer techniques for the analysis of paintings. In Digital art history: A subject in transition, pp. 77-100, Intellect.
Freedberg, D.A., (2007). Empathy, motion and emotion. In: Herding, C and Krause-Wahl A. (eds.), Wie sich Gefühle Ausdruck verschaffen: Emotionen in Nahsicht, pp.17-51. Driesen.
Garnier, F., (1982). Le langage de l’image au Moyen Âge. II. Grammaire des gestes. Le Leopard d’or.
Goodman, N., (1968). Languages of art: An approach to a theory of symbols. Hackett publishing.
Knox, D, (1990) Ideas on Gesture and Universal Languages, c. 1550-c. 1650. In: Henry, J and Hutton, S, (eds.) New Perspectives on Renaissance Thought. Essays in the History of Science, Education and Philosophy in Memory of Charles B. Schmitt. pp. 101-136. Duckworth: London.
Schmitt, Jean-Claude, (1990). La raison des ges
tes dans l’occident médiéval. Editions Gallimard: Paris.
Tavernor, R., (1995). Architectural history and computing. arq: Architectural Research Quarterly, 1(1), pp.56-61.
Van Eck, C., (2007). Classical rhetoric and the arts in early modern Europe. Cambridge and New York: Cambridge University Press.