Scalable detection of statistically significant communities and hierarchies,using message passing for modularity期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

Scalable detection of statistically significant communities and hierarchies,using message passing for modularity

Authors:	Pan Zhang Cristopher Moore

Affiliation:	Santa Fe Institute, Santa Fe, NM, 87501

Abstract:	Modularity is a popular measure of community structure. However, maximizing the modularity can lead to many competing partitions, with almost the same modularity, that are poorly correlated with each other. It can also produce illusory ‘‘communities’’ in random graphs where none exist. We address this problem by using the modularity as a Hamiltonian at finite temperature and using an efficient belief propagation algorithm to obtain the consensus of many partitions with high modularity, rather than looking for a single partition that maximizes it. We show analytically and numerically that the proposed algorithm works all of the way down to the detectability transition in networks generated by the stochastic block model. It also performs well on real-world networks, revealing large communities in some networks where previous work has claimed no communities exist. Finally we show that by applying our algorithm recursively, subdividing communities until no statistically significant subcommunities can be found, we can detect hierarchical structure in real-world networks more efficiently than previous methods.Community detection, or node clustering, is a key problem in network science, computer science, sociology, and biology. It aims to partition the nodes in a network into groups such that there are many edges connecting nodes within the same group and comparatively few edges connecting nodes in different groups.Many methods have been proposed for this problem. These include spectral clustering, where we classify nodes according to the eigenvectors of a linear operator such as the adjacency matrix, the random walk matrix, the graph Laplacian, or other linear operators (1 –3); statistical inference, where we fit the network with a generative model such as the stochastic block model (4 –7); and a wide variety of other methods, e.g., refs. 8 –10. See ref. 11 for a review.We focus here on a popular measure of the quality of a partition, the modularity (e.g., refs. 8 and 12 –14). We think of a partition {t} into q groups as a function t:V → {1, …, q}, where t_i is the group to which node i belongs. The modularity of a partition {t} of a network with n nodes and m edges is defined as $Q ({t}) = \frac{1}{m} (\sum_{〈 i j 〉 \in ℰ} δ_{t_{i} t_{j}} - \sum_{〈 i j 〉} \frac{d_{i} d_{j}}{2 m} δ_{t_{i} t_{j}}) .$ [1]Here ? is the set of edges, d_i is the degree of node i, and δ is the Kronecker delta function. The modularity is proportional to the number of edges connecting nodes in the same community minus the expected number of such edges if the graph were random conditioned on its degree distribution, that is, the expectation in a null model where i and j are connected with probability proportional to d_id_j.However, maximizing over all possible partitions often gives a large modularity even in random graphs with no community structure (15 –18). Thus, maximizing the modularity can lead to overfitting, where the “optimal” partition simply reflects random noise. Even in real-world networks, the modularity often exhibits a large amount of degeneracy, with multiple local optima that are poorly correlated with each other and are not robust to small perturbations (19).Thus, we need to add some notion of statistical significance to our algorithms. One approach is hypothesis testing, comparing various measures of community structure to the distribution we would see in a null model such as Erdős–Rényi (ER) graphs (20 –22). However, even when communities really exist, the modularity of the true partition is often no higher than that of random graphs. In Fig. 1, we show partitions of two networks with the same size and degree distribution: an ER graph (Left) and a graph generated by the stochastic block model (Right), in the detectable regime where it is easy to find a partition correlated with the true one (5, 6). The true partition of the network in Fig. 1, Right has a smaller modularity than the partition found for the random graph in Fig. 1, Left. We can find a partition with higher modularity (and lower accuracy) in Fig. 1, Right, using, e.g., simulated annealing, but then the modularities we obtain for the two networks are similar. Thus, the usual approach of null distributions and P values for hypothesis testing does not appear to work.Open in a separate window Fig. 1.The adjacency matrices of two networks, partitioned to show possible community structure. Each blue point is an edge. (Left) The network is an ER graph, with no real community structure; however, a search by simulated annealing finds a partition with modularity 0.391. (Right) The network has true communities and is generated by the stochastic block model, but the true partition has modularity of just 0.333. Thus, illusory communities in random graphs can have higher modularity than true communities in structured graphs. Both networks have size $n = 5,000$ and a Poisson degree distribution with mean c = 3; the network at Right has c_out/c_in = 0.2, in the easily detectable regime of the stochastic block model.We propose to solve this problem with the tools of statistical physics. As in ref. 16, we treat the modularity as the Hamiltonian of a spin system. We define the energy of a partition {t} as E({t}) = ?mQ({t}) and introduce a Gibbs distribution as a function of inverse temperature β, P({t}) ∝ e^?βE({t}). Rather than maximizing the modularity by searching for the ground state of this system, we focus on its Gibbs distribution at a finite temperature, looking for many high-modularity partitions rather than a single one. In analogy with previous work on the stochastic block model (5, 6), we define a partition ${\hat{t}}$ by computing the marginals of the Gibbs distribution and assigning each node to its most likely community. Specifically, if $ψ_{t}^{i}$ is the marginal probability that i belongs to group t, then ${\hat{t}}_{i} = {argmax}_{t} ψ_{t}^{i}$ , breaking ties randomly if more than one t achieves the maximum. We call ${\hat{t}}$ the retrieval partition and call its modularity $Q ({\hat{t}})$ the retrieval modularity. We claim that ${\hat{t}}$ is a far better measure of significant community structure than the maximum-modularity partition. In the language of statistics, the maximum marginal prediction is better than the maximum a posteriori prediction (e.g., ref. 23). More informally, the consensus of many good solutions is better than the ‘‘best’’ single one (24, 25).We give an efficient belief propagation (BP) algorithm to approximate these marginals, which is derived from the cavity method of statistical physics. This algorithm is highly scalable; each iteration takes linear time on sparse networks if the number of groups is fixed, and it converges rapidly in most cases. It is optimal in the sense that for synthetic graphs generated by the stochastic block model, it works all of the way down to the detectability transition. It provides a principled way to choose the number of communities, unlike other algorithms that tend to overfit. Finally, by applying this algorithm recursively, subdividing communities until no statistically significant subcommunities exist, we can uncover hierarchical structure.We validate our approach with experiments on real and synthetic networks. In particular, we find significant large communities in some large networks where previous work claimed there were none. We also compare our algorithm with several others, finding that it obtains more accurate results, both in terms of determining the number of communities and in terms of matching their ground-truth structure.

Keywords:	networks community detection message-passing algorithms statistical significance phase transitions

设为首页 | 免责声明 | 关于勤云 | 加入收藏