Integrating Latent Dirichlet Allocation and Poisson Graphical Model: A Deep Dive into the Writings of Chen Duxiu, Co-Founder of the Chinese Communist Party.

Anne Shen Chao (, Rice University, United States of America and Qiwei Li (, University of Texas Southwestern Medical Center and Zhandong Liu (, Baylor College of Medicine

Chen Duxiu (1879-1942) co-founded the Chinese Communist Party in 1920, and served as its secretary general from 1921 to 1927. He was a prolific author and a cultural rebel whose writings transformed the intellectual and social landscape of 20 th century China. Yet from 1904 to about 1919, Chen advocated Western democracy and Social Darwinism as solutions to save China. His turn to communism was an abrupt transition, and many historians credited this to the influence of his colleague, and co-founder of the CCP, Li Dazhao (1888-1927). Both Li and Chen had studied in Japan, and through their interaction with Japanese Socialists and fellow students, became acquainted with literature on socialism and anarchism. Some say that Li was the theoretician who understood Bolshevism and Marxism in depth, while Chen did not become well-versed in Marxism until he founded the CCP (Yoshihiro, 2013).

In this paper, we applied topic modeling (Blei et al., 2012) to a select number of Chen’s and Li’s published articles, in an attempt to detect the difference, if any, in their interpretation on the subject of socialism, Marxism, communism and Bolshevism. We integrated two well-developed statistical methodologies, the Latent Dirichlet Allocation (LDA) and the Poisson Graphical Model (PGM), to probe in finer detail the broad themes in the 892 pieces of Chen’s essays, correspondences, and occasional poetry, comprising a total of 1,347,699 Chinese characters. Based on the word counts per topic, we then implemented the PGM method to study the association among different topics. The use of PGM minimizes any misleading inference caused by confounding variables, and it also leads to a more concise structure of the network of topics.

Specifically, we chose 263 articles written by Chen Duxiu and 53 written by Li Dazhao, containing words related to Marxism, socialism, Bolshevism, and communism (Ren, 2018; Li ,1984). (Both selections covered the length of the men’s publishing career; Chen passed way at age 63, while Li was executed at age 39). A document-term matrix (bag-of-words data) was generated from the preprocessed text. Next, we carefully selected a set of seed words for each of K topics of interest. We then applied the topic modeling method LDA to the bag-of-words data to find the remaining mixtures of words associated with each topic. Consequently, we could interpret each estimated topic by abstracting the top ranking terms within that topic. We then generated a new document-topic matrix from the document-term matrix by calculating the counts of those top words from the same topic. Finally, we applied the Poisson Graphical Model to the document-topic matrix to infer the conditional independence between each pair of topics. The resulting graph is a network visualization where each node represents a topic, and each edge indicates the conditional dependencies among the topics, meaning the two topics that are linked by an edge are correlated even after adjusting for all the other topics in the corpus.

The results yield several initial observations: Chen used a smaller set of vocabulary words over and over again to emphasize a point, while Li adopted a more discursive style with fewer repeats of the same word. Chen used many more verbs (such as: “agitate,” “struggle,” “unite,” “lead,” “develop,” “carry out”), thereby exhorting his readers to action, while Li tended to use descriptive words. Chen focused on the present by analyzing different political groups: “Guomindang,” “warlords,” “proletariat,” “bourgeoisie,” “military,” “students,” “masses” and “imperialists.” Li painted a larger scenario by using words such as “world,” “humanity,” “philosophy,” “phenomenon,” “relationship,” “history” and “religion.” The general conclusion at this early stage of analysis is that Chen urged his readers to put into action his plans to bring China under communism, while Li tended to explain to his readers the nature of Bolshevism and Marxism.

More interestingly, these calculations yielded “orbits” of vocabulary for each man’s important ideas. For instance, Chen’s use of the word “revolution” appeared three times in the 8 topics that we studied. In the first sub-topic, “revolution” appeared with words such as “class,” “bourgeoisie,” “proletariat,” “develop,” “strength,” and “movement.” In the second sub-topic, “revolution” again appeared alongside “peasants,” “bourgeoisie,” “proletariat,” “lead,” “China,” “masses,” “movement,” and “action.” In the third sub-topic, “revolution” appeared with “bourgeoisie,” “proletariat,” “struggle,” “China,” “Guomindang,” “movement.” Li, when he discussed “revolution,” which appeared twice in the four topics we studied, he often used words such as “people,” “Russia,” “movement,” “government,” “masses,” “future,” and “China.” While the general trend of these two men’s writing is clear by a casual browsing of all of these articles, but this method of calculation demonstrates in a quantitative manner the qualitative interdependence of topics, and diagrams in an easy to read manner the network configuration of the vocabulary of each man.

Appendix A

  1. Blei, David M. (2012) Probabilistic topic models: Communications of the ACM, 55(4): 77-84.
  2. Li D. (1984). Li Dazhao Wenji [A literary collection of Li Dazhao]. Beijing: Renmin chubanshe.
  3. Ren J. ed. (2008), Chen Duxiu zhuzuo xuanbian [A selected collection of Chen Duxiu’s writing]. Shanghai: Shanghai renmin chubanshe. 6 vols.
  4. Yoshihiro, I., tr. by Fogel J. (2013) Formation of the Chinese Communist Party. New York: Columbia University Press.