Machine Reading Part II: Advanced Topics in Word Vectors

Eun Seo Jo (eunseo@stanford.edu), Stanford University, History Department y Javier de la Rosa Pérez (versae@stanford.edu), Stanford University, Center for Interdisciplinary Digital Research y Scott Bailey (scottbailey@stanford.edu), Stanford University, Center for Interdisciplinary Digital Research y Fernando Sancho (fsancho@us.es), Dept. of Computational Sciences and Artificial Intelligence at the University of Seville

Description

This half day workshop is an introduction to word vectors and text vectorization broadly. We will focus on building intuition of how word vectors work, incorporating visualization methods, using pre-trained vectors, and exploring applications of word embeddings. We will teach you both the high-level concepts and the practical usages of these widely used analytical tools for text analysis in digital humanities (DH). It is a hands-on workshop with practical activities for the participants starting with a review of word vectors by way of visualization, an overview of downloadable word vectors, and examining the potential pitfalls of using word vectors in humanistic analysis and the methods for mitigating these issues. Given the general applicability of machine learning models in real life, addressing issues concerning biased models, datasets, and algorithms, is of vital importance for correct interpretation of their applications.

We will provide a Python Jupyter Notebook and an accompanying text corpus that we will work through as a group. By the end of the workshop, the participants will have working knowledge of how and where to download or train word embeddings and the caveats of using them.

Relevance to the DH Community

Since the apparition of analytical approaches to distant reading and macro-analysis, popularized by Moretti and Jockers, and the possibility of access to huge amounts of textual data and long-term studies such as Culturomics, new tools were needed to tackle the increasing complexity of large corpora. Borrowing from advances in machine learning and computational linguistics, digital humanists have experimented with various methods of text quantification for interpreting macro contours of culture and language. In particular, word vectors have gained recognition for their versatility in DH studies. Scholars have used word vectors in a variety of tasks such as measuring similarity in word meaning (Caliskan et al., 2017), authorship attribution (Kocher et al., 2017), or dialogism in novels (Muzny et al., 2017).

This workshop is both a theoretical and practical introduction to humanist applications of these methods. Those interested in large scale text-analysis of any corpora will learn the basics of transforming textual data into numerical form.

Instructors

Eun Seo Jo researches the language of American foreign relations in historical contexts and applications of NLP and ML in history. She is a PhD candidate in history at Stanford University where she is also a member of the Literary Lab and a Digital Humanities Fellow. She has presented at various DH conferences and is a DH methodology consultant at Stanford.

Scott Bailey is a Research Developer in the Center for Interdisciplinary Digital Research in the Stanford University Libraries. He collaborates and consults on research projects across the humanities and social sciences, and teaches workshops on tools and methods in digital scholarship, such as natural language processing. His research ranges from vulnerability in the context of theological anthropology to computational approaches to systematic and historical theological works, such as Karl Barth’s Church Dogmatics .

Javier de la Rosa is a Research Engineer at the Center for Interdisciplinary Digital Research, a unit at the Stanford University Libraries focused on digital scholarship. He is an active member of the DH scholarly community at Stanford and regularly participates in conferences, professional organizations, and teaches workshops and tutorials to faculty and graduate students. He holds a Post-doctorate research fellowship and a PhD in Hispanic Studies at Western University, Ontario, where he also served as Tech Lead for the CulturePlex Lab. He completed both his MSc. in Artificial Intelligence and BSc. in Computer Engineering at University of Seville, Spain. His work and interests span from cultural network analysis and computer vision, to text mining and authorship attribution in the Spanish Golden Age of literature.

Fernando Sancho is an Associate Professor at the Dept. of Computational Sciences and Artificial Intelligence at the University of Seville, and holds a PhD by the same university. He has worked in topics ranging from complex systems, and data analysis to cultural objects studies. He has regularly collaborated with the CulturePlex Lab at the University of Western Ontario, and the Complex Systems Modeling Group at University of Central Ecuador.

Target Audience and Prereqs

Post-docs, faculty, and advanced graduate students with Python prerequisites. Although the main concepts will be overviewed, knowledge of basic word embeddings and word2vec specifically would be desirable. In order to participate fully in all activities, participants must have working knowledge of basic programming concepts, the Python language, data structures, and the Numpy library.

  • Technical Support: Microphones and Projector
  • Proposed Length: Half-day (4 hours; 4 sessions)
  • Medium: Notebook (Jupyter)
  • Libraries: Numpy, Pandas, Textacy, SpaCy, Gensim, scikit-learn, matplotlib

Workshop Outline

The workshop is split into four 50 min sessions with 10 minutes breaks in-between. We teach several methods in each unit with increasing difficulty. The schedule is broken down below:

  1. Understanding Word Vectors with Visualization

This unit will give a brief introduction of word vectors and word embeddings. Concepts needed to understand the internal mechanics of how they work will also be explained, with the help of plots and visualizations that are commonly used when working with them.

  • 0:00 - 0:20 From word counts to ML-derived Word Vectors (SVD, PMI, etc.)
  • 0:20 - 0:35 Clustering, Vector Math, Vector Space Theory (Euclidean Distance, etc.)
  • 0:35 - 0:50 [Activity 1] Visualizations (Clustering, PCA, t-SNE) [We provide vectors]
  1. Word Vectors via Word2Vec

This unit will focus on Word2Vec as an example of neural net-based approaches of vector encodings, starting with a conceptual overview of the algorithm itself and end with an activity to train participants’ own vectors.

  • 0:00 - 0:15 Conceptual explanation of Word2Vec
  • 0:15 - 0:30 Word2Vec Visualization and Vectorial Features and Math
  • 0:30 - 0:50 [Activity 2] Word2Vec Construction [using Gensim] and Visualization(from part 1) [We provide corpus]
  1. Extended Vector Algorithms and Pre-trained Models

This unit will explore the various flavors of word embeddings specifically tailored to sentences, word meaning, paragraph, or entire documents. We will give an overview of pre-trained embeddings including where they can be found and how to use them.

  • 0:00 - 0:20 Overview of other 2Vecs & other vector engineering: Paragraph2Vec, Sense2Vec, Doc2Vec, etc.
  • 0:20 - 0:35 Pre-trained word embeddings (where to find them, which are good, configurations, trained corpus, etc.)
  • 0:35 - 0:50 [Activity 3] Choose, download, and use a pre-trained model
  1. Role of Bias in Word Embeddings

In this unit, we will explore an application and caveat of using word embeddings -- cultural bias. Presenting methods and results from recent articles, we will show how word embeddings can carry historical bias of the corpora trained on and lead an activity that shows these human-biases on vectors and how they can be mitigated.

  • 0:00 - 0:10 Algorithmic bias vs human bias
  • 0:10 - 0:40 [Activity 4] Identifying bias in corpora (occupations, gender, ...) [GloVe] (Caliskan et al., 2017)
  • 0:40 - 0:50 Towards unbiased embeddings; Examine “debiased” embeddings
  • 0:50 - 0:60 Conclusion remarks and debate

Appendix A

Bibliography
  1. Caliskan, A., Bryson, J.J., Narayanan, A., 2017. Semantics derived automatically from language corpora contain human-like biases. Science 356, 183–186. https://doi.org/10.1126/science.aal4230
  2. Kocher, M., Savoy, J., 2017. Distributed language representation for authorship attribution. Digital Scholarship in the Humanities. https://doi.org/10.1093/llc/fqx046
  3. Nanni, F., Dietz, L., Ponzetto, S.P., 2017. Toward a computational history of universities: Evaluating text mining methods for interdisciplinarity detection from PhD dissertation abstracts. Digital Scholarship in the Humanities. https://doi.org/10.1093/llc/fqx062