Text Mining Methods to Solve Organic Chemistry Problems, or Topic Modeling Applied to Chemical Molecules
The Renaissance Humanism was probably the last moment in the history of ideas when the development of exact sciences was shaped according to the intellectual paradigms of the humanities (the Liberal Arts, to be precise). After the advent of the Scientific Revolution in the 17th century – with its empiricism, experimental reasoning, mathematical apparatus, and so forth – the exact sciences became the point of reference for all the other disciplines, in terms of scientific inference and its methodology. The imbalance between the humanities and the sciences has been growing ever since. Nowadays, statistical analysis is routinely applied in social sciences, cognitive linguistics tries to take advantage of the fMRI technology, text analysis studies are overwhelmed by numerous machine-learning techniques, ranging from hierarchical cluster analysis to Support Vector Machines classification and Deep Learning. The exact sciences have affected the humanities to a considerable extent, but at the same time they continue to be rather resistant to any methodological inspirations coming from the “soft” scholarship. This study is an example of such a reversed influence, since we propose to apply text mining methods to study chemical molecules. Arguably, the phrase “If an atom is a letter, then a molecule is a word”, even if popular in chemistry, sounds rather naïve for anyone who has some expertise in linguistics. Nonetheless, despite a shallow similarity between language structures and organic chemistry at first glance, the methodology developed in text mining proves very promising as a way to discover internal molecule structures.
2. The problem
One of the biggest issues in contemporary organic chemistry is an enormous number of different molecules and their fragments that play role in chemical reactions. To cut a long story short: any reaction involves certain changes in molecules’ structures, which usually means that certain bonds are disjoined, and particular atoms change their positions within each molecule. On theoretical grounds, these changes can be predicted and/or controlled. In practice, however, predicting optimal bond cuts requires high-level expert knowledge, due to the extreme complexity of the problem, or an enormous computer power to run brute-force combinatoric algorithms. This is, however, still far beyond our capabilities, because completing a task that involves testing billions of billions of combinations would require decades if not centuries. For that reason, the big question at stake is how to optimize the entire process of identifying relevant molecule substructures (Ruddigkeit et al., 2012).
Splitting complex chemical molecules into “meaningful” substructures is the first problem to be overcome. In this context, “meaningful” means groups of atoms that are local centers of reactions. The nature of bonds between atoms is very well understood since the first half of the 20th century. However, it is still unclear why certain clusters of atoms tend to keep together while reprehend some other groups. Being one of the most crucial issues in organic chemistry, this question has been approached using different methods, which are aimed at finding repetitive fragments of molecules. It can be assumed that methods derived from text mining can be adopted to (partially) solve the task.
3. Chemical “words”
Let us assume that a molecule is a sentence (with some obvious caveats in mind, non-linearity of molecules being the most important one). If so, then a list of known molecules can be considered a corpus. Quite striking is the fact that a commonly used convention of describing chemical structures (referred to as SMILES) uses sequences of characters, what makes any comparisons to corpora even more natural. E.g., caffeine is coded as follows: CN1C=NC2=C1C(=O)N(C(=O)N2C)C.
To make the language–chemistry parallel complete, one has to define “words” as well, keeping in mind that there are no explicit substructure boundaries in molecules. To this end, we adopt the idea of Cadeddu et al. (2014), who compared a few thousands of molecules pairwise, in order to extract their maximum common substructures, with the belief that they represent chemical “words”; this step was followed by a term frequency–inverse document frequency (tf/idf) heuristic.
Using the above idea of extracting “words”, we picked randomly 50,000 reactions from the Reaxys database ( www.reaxys.com), and computed the pairwise comparison, resulting in a corpus of >800,000 word types and 2.5 * 10 9 tokens. Interestingly enough, the chemical “words” share the characteristics of a typical natural language, e.g. they follow the Zipf’s law, but they also exhibit the behavior of function and content words in their relation to frequency (see Fig. 1). Moreover, the chemical “words” can be subject to time-proven text mining methods such as keywords analysis, as has been demonstrated in our previous study (Woźniak et al., 2018).
4. Topic modeling
In order to identify any relations between chemical “words”, we analyzed our corpus using topic modeling (Blei et al., 2003), a technique that attracted a good share of attention in Digital Humanities, but has never been popular beyond text-centric applications. Topic modeling belongs to a group of distributional semantics methods, which are based on a general assumption that the meaning of a word is defined by its lexical context (Firth, 1962). In its extended form, the distributional hypothesis says that the degree of semantic similarity between words can be modeled as a function of the degree of overlap among their linguistic contexts (Miller and Charles, 1991; Baroni and Lenci, 2010). Topic modeling, usually computed via the LDA algorithm (Blei et al., 2003) assumes the “bag-of-words” type of context, which means that the sequence of words in a sentence is irrelevant. This feature allows for computing chemical “words”, which, essentially, do not follow any linear sequence.
We trained a few models ranging from 50 to 200 topics, using the LDA technique. Therefore, we were able to substantially reduce the enormous number of >800,000 “word” types into a small number of word constellations (topics) that contain meaningful information about co-occurring chemical fragments. One of the topics is shown in Fig. 2. Among the 24 most distinctive “words” one can recognize some amines, fragments of aromatic rings, fragments containg carboxyl functional groups, and so on. Inspected by trained practitioners in organic chemistry, the topics revealed several collocations that seemed meaningful, and could not have been identified in the original (raw) collection of molecules. Despite the intuitive interpretation via close-reading, however, such an outcome inevitably leads to a more serious question, namely if one can define meaning in organic chemistry, in the context of distributional semantics.
Interesting as they are, the chemical topics cannot solve any real-life problem per se, even if they seem to be meaningful from the naked eye’s perspective (note that the same holds for topic modeling based on texts). Specifically, one cannot discover any general structure of, say, natural products by manual inspection of their prominent topics, nor can one predict if a given substance is likely to be toxic. There is a plethora of similar classification (or prediction) tasks where topics might prove useful, provided that the analysis goes beyond the close-reading perspective. If the topics’ proportions are indeed significantly different across the corpus – i.e. if they really keep some information about semantic differentiation between the molecules – they should be applicable as a set of input features for machine-learning classification.
To test this hypothesis, we designed a controlled experiment on a (somewhat artificial) problem of classifying molecules as potential drugs. Again, we used the same Reaxys database to extract relevant training material: 1,800 known drugs and a similar number of known non-drugs. Our two-class supervised setup involved a simple neural network (implemented via Keras with Tensorflow backend), the input layer being the most probable topics for each chemical molecule. The final results varied depending on a topic model used for prediction, nevertheless they turned out to be fairly optimistic. The best accuracy was: 0.7851 (the model for 200 topics), the worst: 0.7135 (the model for 50 topics). Even if preliminary, these results suggest that some semantic information can be indeed extracted from chemical corpora using text mining algorithms.
This research is part of project UMO-2014/12/W/ST5/00592, supported by Poland’s National Science Centre.
- Baroni, M. and Lenci, A. (2010). Distributional memory: A general framework for corpus-based semantics. Computational Linguistics, 36(4): 673–721.
- Blei, D. M., Ng, A. Y. and Jordan, M. I. (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research, 3: 993–1022.
- Cadeddu, A., Wylie, E. K., Jurczak, J., Wampler-Doty, M. and Grzybowski, B. (2014). Organic chemistry as a language and the implications of chemical linguistics for structural and retrosynthetic analyses. Angewandte Chemie, 126(31): 8246–50.
- Firth, J. R. (1962). A synopsis of linguistic theory 1930-55. In Firth, J. R., Studies in Linguistic Analysis. Oxford: Blackwell.
- Miller, G. A. and Charles, W. G. (1991). Contextual correlates of semantic similarity. Language and Cognitive Processes, 6(1): 1–28.
- Ruddigkeit, L., Deursen, R. van, Blum, L. C. and Reymond, J.-L. (2012). Enumeration of 166 billion organic small molecules in the chemical universe database GDB-17. Journal of Chemical Information and Modeling, 52(11): 2864–75.
- Woźniak, M., Wołos, A., Modrzyk, U., Górski, R. L., Winkowski, J., Bajczyk, M., Szymkuć, S., Grzybowski, B. and Eder, M. (2018). Linguistic measures of chemical diversity and the ‘keywords’ of molecular collections. Scientific Reports, 8: forthcoming.