“Moon:” A Spatial Analysis of the Gumar Corpus of Gulf Arabic Internet Fiction
The Gumar Corpus (
https://camel.abudhabi.nyu.edu/gumar/
) consists of 110 million words from 1,200+ Internet forum novels written in a conversational style about romantic topics. Whereas the corpus was originally harvested and annotated for use within the context of dialectal Arabic (DA) natural language processing, the material is also of cultural and sociological significance concerning popular culture in the Gulf Arab region. The corpus’ name comes from the Gulf Arabic word for moon {gumεr}, a popular Arabic term of endearment. Whereas the genre is all but unknown outside of the Arab world, the Arabic blogosphere and social media are full of discussions about these “net novels,” the authors of which are purportedly young women. In addition to Modern Standard Arabic (MSA), we have five dialect varieties mapping to roughly 12 national sub-varieties of dialectal Arabic–usually only one tag is assigned to each internet novel. Our poster is a very first attempt to tap into the cultural richness of the corpus using methods adapted to the Arabic language, in particular from the angle of spatial analysis of corpora.
The internet novels sometimes identify their country of origin in a short prologue, but there are additional clues as to their provenance including the fact that they are all written in DA, which is not necessarily the native dialect of the author. Much progress has been made in information extraction and NLP for Arabic in the last decade, but in dialectal forms much work remains to be done to catch up to Western languages. Even though we would not expect significant variance in toponyms in DA, initial attempts at extracting place names directly from the Arabic corpus posed a methodological challenge, particularly for disambiguation. Practical workarounds, often translingual and through English, are sometimes adopted in such cases with Arabic, as in the case of BetaCode that uses English script to deal with the vocalization, or partial vocalization of texts (Romanov, 2015).
With the goal of extracting place name entities from Gumar, our pilot study carried out morphological analysis and disambiguation on the texts using MADAMIRA (Pasha et al., 2014), a tool that currently functions for both for MSA and Egyptian Arabic. The configuration for Egyptian has been shown to outperform the MSA setting when compared to a manually annotated sample of 4K words from the Gumar corpus (Khalifa et al., 2015). Since the MADAMIRA morphological annotation provides both the lemma of a word and the English translation of the lemma, we build an English approximation of the novels and run them through Stanford Named Entity Recognizer to detect locations (Finkel et al., 2005).
Using Stanford NER, 19000+ occurrences of some 400+ distinct locations were extracted from the aggregate of the novels. Having English versions of the place names made the geocoding a relatively straightforward process. Geovisualization shows that the highest frequency of place names are found in the Arabian Gulf, Iraq,
bilad as-Sham and Al-Andalus (southern Spain), as well as in England, France and Germany. Given that about sixty percent of the novels are identified as the dialect of Saudi Arabia, the high frequency of mentions of the Kingdom seems predictable. On the other hand, the places are not specific locales, as in the case of the city-level geographies of Palestine and Iraq. Other more detailed analysis about such specificity of place needs to be carried out through subsequent close reading of the corpus.
While the corpus is a rare opportunity both to work with contemporary popular culture Arabic in the textual digital humanities and to experiment with named entity recognition methods for non-Western contexts, caution must be exercised in our interpretations since the methods which work well for western languages are much more tentative in the (regional) Arabic case. For example, some cross checking was done against the Arabic texts in the corpus and revealed errors where DA colloquialisms {
kif} (“what?”) and {
bliz} (“please”), generated some high frequency false locations “Kiev” and “Belize.” As our research evolves, we intend to benchmark other Arabic-only tools for entity recognition to test their stability and performance on the set of materials in question (
Gahbiche-Braham et al. 2013; Shaalan 2014). Time permitting, we would like to begin to do some correlations between topic and geography, what has been recently labelled a “geospatial semantics” (Gavin/Gidal, 2017) but for the transnational, multiregional context of Arabic. Our hope is to use the Gumar corpus to take on more in-depth analysis of a Gulf Arabic geopoetics of romance.
Appendix A
-
Finkel, J. R., Grenager, T. and Manning, C. (2005). Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling. In
Proceedings of the 43nd Annual Meeting of the Association for Computational Linguistics (ACL 2005), pp. 363-370. -
Gahbiche-Braham, S., Bonneau-Maynard, H., and Yvon, F. (2013). Traitement automatique des entités nommées en arabe: détection et traduction.
Traitement Automatique des Langues
, 54(2)
: 101-32. -
Gavin, M., Gidal, E. (2017). Scotland’s Poetics of Space: An Experiment in Geospatial Semantics,
Cultural Analytics.
http://culturalanalytics.org/2017/11/scotlands-poetics-of-space-an-experiment-in-geospatial-semantics/
(accessed 30 April 2018). -
Khalifa, S. et al. (2016). A Large Scale Corpus of Gulf Arabic, In
Proceedings of the Language Resources and Evaluation Conference (LREC), Portorož, Slovenia, pp. 4282-89. -
Pasha, A. et al. (2014). MADAMIRA: A Fast, Comprehensive Tool for Morphological Analysis and Disambiguation of Arabic. In
Proceedings of the Language Resources and Evaluation Conference (LREC), Reykjavík, Iceland, pp. 1094-1101. -
Romanov, M. (2015). BetaCode for Arabic,
Al-Raqmiyyat,
https://maximromanov.github.io/2015/02-07.html
(accessed 30 April 2018). -
Shaalan, K. (2014) A Survey of Arabic Named Entity Recognition and Classification,
Computational Linguistics. 40(2): 469-510.