Graphs, texts and Statistics

Submitted by Insmi on Fri, 12/16/2016 - 17:41

Since the early works of Moreno in the 30s, graph's analysis has become an intensive area of research which is no longer limitated to sociology. Indeed, the presence of a large range of data on the form of graphs is available in our numeric world. Besides, impressive progress has been performed in the modelling and analysis of these data. In a recent work, Charles Bouveyron and Pierre Latouche propose a new statistical method, called STBM (Stochastic Topic Block Model), which allows to separate nodes of a network with textual edges by simultaneously providing the main topics of discussions. One can for example analyze text exchanges between persons in a social network or in the context of emails exchanges in a firm.

From a mathematical point of view, STBM generalized the Stochastic Block Model (SBM), dedicated to the nodes' clustering, and the Latent Dirichlet Allocation (LDA), dedicated to texts' analysis. STBM was used to analyse the emails of the firm Enron, which went through a mediatic bankrupt beginning of the 2000s. It identified that the network was consisting of 10 groups of persons and 5 topics for discussions.

Figure 1 gives a visualization of the groups (by the color of the nodes) together with the topics (by the color of the links).

the network
Figure 1 : STBM analysis of the Enron emails network.

Among awaited topics related to the firm activity, STBM enlighted topics 2 and  3, which happen to be two aspects of the Enron scandal, namely the relationship between Enron, the White House and  the Talibans, and the implication of Enron in the bankrupt of the Edison firm.

The table of topics
Figure 2 : The most frequently used words in each of the 5 topics for discussion.

Have a go at the Interactive Results for the Enron Email Network online.

Link to the detailed French version of the article


Reference :
C. Bouveyron, P. Latouche and R. Zreik, The Stochastic Topic Block Model for the Clustering of Networks with Textual Edges, Statistics and Computing, in press, 2017.

Contacts :
Charles Bouveyron | Mathématiques Appliquées à Paris 5 (MAP5) | UMR 8145 | CNRS & Université Paris Descartes.
Pierre Latouche | Laboratoire Statistique, Analyse, Modélisation Multidisciplinaire (SAMM) | EA 4543 | Université Paris 1 (Panthéon-Sorbonne).