The Time-Us project. Creating gold data to understand the gender gap in the French textile trades (17th–20th century)
The role of women in industrial development is now largely recognized in sociological and economic studies on developing countries during the first industrial revolution in Europe. Yet data on their remuneration, schedules and domestic work, and that of men working in the same sectors, remains deficient for many regions, especially for France. The Time-Us project aims to reconstruct the remuneration and time budgets of women and men working in the textile trades in four French industrial regions (Lille, Paris, Lyon, Marseille) in a long-term perspective, by bringing together a multidisciplinary team of historians, natural language processing (NLP) experts and sociologists. It will create comparable series on the remuneration and time allocation of employed men and women (i) through classical sources and company and trade association archives, and (ii) by piecing together a series of qualitative sources identifying words and actions associated with work in both domestic and non-domestic activities. The project will provide keys to understanding the gender gap by analyzing changes in work and time uses during the first industrialization process.
The Time-Us team works on a heterogenous corpus of French handwritten and printed sources spanning from the seventeenth to the twentieth century. These documents are mainly preserved in French local archives, from the four industrial regions that have been mentioned above (for instance, Archives municipales de Lyon, Bibliothèque municipale de Lyon, or Bibliothèque nationale de France in Paris, etc.). The analyzed corpus brings together numerous historical sources, and includes court decisions, petitions, police reports and files, and sociological surveys on living conditions of the working class (especially Les monographies de famille de l’École de Le Play or Le Play’s families’ budgets (Hincker, 2011)). Many of these documents are manuscripts, written by various hands over long periods of time (more than a hundred years for the “Registre de contraventions aux règlements des métiers" that begins in 1670 and ends in 1781 (Lyon, Archives municipales).
This unpublished set of documents constitutes an important corpus of historical sources that is well-suited for applying computational analysis. In this paper, we will present the approach adopted by the Time-Us team to analyze this corpus. We will also discuss the prospects opened up by this project for historical research in terms of digital research workflow.
Our goal consists in applying NLP methods to heterogeneous historical documents, in order to identify and analyze the relevant semantic or syntactic patterns that describe work, remuneration and time budgets. The application of such methods, mainly parsing, will facilitate the analysis of the corpus by creating series of comparable quantitative and qualitative data:
Quantitative data on remunerations, household budgets and time spent for domestic (or unpaid) and non-domestic (or paid) work by women and men.
Qualitative data on paid and unpaid tasks realized by women at home and at work, namely information on the type of the task, its description, its duration and its results. Computational methods will also be used to extract statements describing the women performing these tasks (occupation, social status, age, marital status, family composition), and the relationships between the actors involved in these tasks, especially between men and women (family relationships such as husband and wife, brothers and sisters mothers and sons, or working relationships such as employers and employees).
The analyzed sources can take a number of varied forms. Thus, we chose to work closely with economic and labour historians in the data modelling process. As the corpus gathers together diverse historical sources, the definition of a light and flexible annotation schema, bringing together the history and language processing experts, is a major step to create “gold data” to train parsing models. This gold data take the form of annotated texts encoded in TEI ( Text Encoding Initiative). TEI can be seen as a bridge representation for historians and NLP experts: in this approach, historians annotate a first set of documents in TEI, in order to create training data that can be easily processed and analyzed by NLP experts. Besides, the choice of the TEI allows for the creation of sustainable data, that can be reused in the long term by other projects and researchers. Our aim consists also in creating a flexible TEI data model that will be relevant to modelize differents types of data, and that will enable NLP experts to extract comparable information such as quantitative data (amounts of money, period of time…). In this way, this model could be reused by other research projects especially, but not only, projects of economic and labor history.
A first step is the transcription of the manuscripts into a simple TEI representation, covering the text and a set of metadata. This task is nothing but trivial, due to the diversity of sources mentioned above, but it is not the scope of this paper. Then, the representation is enriched by annotation layers. The first annotation layer is the recognition of tasks and occupations, linked to their associated amounts of money, and the actors of the transaction. The extraction of Named Entities such as person and place names is also necessary in order to properly analyse how gender and localization influence remuneration.
The annotation process will start as a collaborative effort, in order to get a first dataset that could possibly be used to train/configure NLP tools, but also to help designing a precise annotation guide between the NLP people and historians. At a later stage, we will progressively deploy more automatic NLP tools to create these annotations. In this regard, we plan to identify the elements of vocabulary (tasks, products) and the interesting phrases (e.g. “ someone was paid (this amount) for (this product) for a (given amount of time)”), using knowledge acquisition techniques based on the distributional hypothesis and syntactic analysis of the corpus. The knowledge of the domain will allow us to define syntactic extraction pattern to be applied on the corpus to detect and annotate specific instances of tasks, products, money, people and relationships between these pieces of information. Some human validation will still be needed to filter the vocabulary, refine the patterns, and propose missing elements (vocabulary and patterns). Language processing will be conducted with the French processing chain developed by the INRIA Almanach team, and in particular with the FRMG parser (Morardo and de La Clergerie 2014). Parsing produces dependencies between words, allowing us to identify who does what, when, how for some event. The processing chain has already by used several times for knowledge acquisition over specific domains (legal, medical). In our case, specific issues may arise because of the quality of the transcriptions and the peculiarities of the language used, which contains archaic constructions, whereas our parser was designed for contemporary French.
Example of a parse for one sentence of the corpus
The annotation task is therefore mainly collaborative, so the need for a shared framework has emerged. Several digital projects have already taken into account the specific needs of historians in terms of image visualization, transcription and collaboration. For instance, the Transkribus interface enables Humanities scholars to transcribe handwritten and printed historical sources, and offers a very powerful Handwritten Text Recognition engine. The project Transcribe Bentham takes account the collaborative dimension in transcribing historical documents. The Old Bailey transcription project uses a combination of hand encoding an automatic recognition and extraction systems. Nevertheless, they do not address all the requirements of Humanities scholars working on primary sources, and the need of comprehensive Digital Humanities-based publishing systems is emerging. We have chosen to setup a specific digital workflow enabling historians and NLP experts to work together. We will present the solution that has been put in place, and especially a customized wiki with:
the Transcribe Bentham transcription desk, adapted to our needs,
and a TEI toolbar, specifically customized for tagging named entities and measures.
Customization of the TEI toolbar
- Clergerie, É. D. L., Sagot, B., Stern, R., Denis, P., Recourcé, G. and Mignot, V.
- (2009). Extracting and Visualizing Quotations from News Wires. vol. 6562. Springer, pp. 522–32 doi:10.1007/978-3-642-20095-3_48. https://hal.inria.fr/inria-00607463/document (accessed 24 April 2018).
- Hincker, L. (2001). Les monographies de famille de l’École de Le Play. Les Études sociales, n 131-132, 1er et 2e semestres 2000. Revue d’histoire du XIXe siècle. Société d’histoire de la révolution de 1848 et des révolutions du XIXe siècle(23): 274–76.
- Morardo, M. and Clergerie, É. V. de L. (2014). Towards an environment for the production and the validation of lexical semantic resources. https://hal.inria.fr/hal-01005464/document (accessed 24 April 2018).
- Seaward, L. and Kallio, M. (2017). Transkribus: Handwritten Text Recognition technology for historical documents. Montréal https://dh2017.adho.org/abstracts/649/649.pdf (accessed 24 April 2018).
- Thomasset, F. and Clergerie, É. D. L. (2005). Comment obtenir plus des Méta-Grammaires. Proceedings of TALN’05. Dourdan, France.
- University College London UCL Transcribe Bentham http://blogs.ucl.ac.uk/transcribe-bentham/ (accessed 24 April 2018).
- Old Bailey Online - The Proceedings of the Old Bailey, 1674-1913 - Central Criminal Court https://www.oldbaileyonline.org/ (accessed 26 April 2018).