# Graph Clustering

### Project Summary

We explore the problems of large-scale graph clustering and overlapping community detection. A community (also referred to as a cluster) is a set of cohesive vertices that have more connections inside the set than outside. Finding good clusters in a graph is an important task in network analysis. In this project, we both tackle the traditional graph clustering problem where the clusters are pairwise disjoint, and the overlapping community detection problem where the clusters are allowed to be overlapped with each other.### Project Description

**1. Scalable and Memory-Efficient Clustering of Massive Graphs**

A popular class of graph clustering algorithms for large-scale networks, such as PMetis, KMetis and Graclus, is based on a multilevel framework. Generally, these multilevel algorithms work reasonably well on networks with a few million vertices. However, when the network size increases to the scale of 10 million vertices or greater, the performance of these algorithms rapidly degrades. An inherent property of social networks, the power law degree distribution, makes these algorithms infeasible to apply to large-scale social networks. We propose a scalable and memory-efficient clustering algorithm, GEM, which efficiently extracts a good skeleton graph from the original graph, and propagates the clustering result of the extracted graph to the rest of the network. Experimental results show that GEM produces clusters of quality comparable to or better than existing state-of-the-art graph clustering algorithms, while it is much faster and consumes much less memory. Furthermore, the parallel implementation of GEM, called PGEM, not only produces higher quality of clusters but also achieves much better scalability than most current parallel graph clustering algorithms.

**2. Overlapping Community Detection Using Seed Set Expansion**

We also tackle the problem of overlapping community detection where clusters (also referred to as communities) are allowed to be overlapped with each other. In many real social and information networks, the clusters naturally overlap. For instance, in a social network, each vertex in a graph corresponds to an individual who usually participates in multiple communities. One of the most successful techniques for finding overlapping communities is based on local optimization and expansion of a community metric around a seed set of vertices. We propose an efficient overlapping community detection algorithm using a seed set expansion approach. In particular, we develop new seeding strategies for a personalized PageRank scheme that optimizes the conductance community score. The key idea of our algorithm is to find good seeds, and then expand these seed sets using the personalized PageRank clustering procedure. Experimental results show that this seed set expansion approach outperforms other state-of-the-art overlapping community detection methods. We also show that our new seeding strategies are better than previous strategies, and are thus effective in finding good overlapping clusters in a graph.

### Main Publications

- Overlapping Community Detection Using Seed Set Expansion (pdf, slides)

J. Whang, D. Gleich, I. Dhillon.

In*ACM Conference on Information and Knowledge Management (CIKM)*, pp. 2099-2108, October 2013. - Scalable and Memory-Efficient Clustering of Large-Scale Social Networks (pdf, slides)

J. Whang, X. Sui, I. Dhillon.

In*IEEE International Conference on Data Mining (ICDM)*, pp. 705-714, December 2012. - Stochastic Blockmodel with Cluster Overlap, Relevance Selection, and Similarity-Based Smoothing (pdf, slides)

J. Whang, P. Rai, I. Dhillon.

In*IEEE International Conference on Data Mining (ICDM)*, pp. 817 - 826, December 2013. - Parallel Clustered Low-rank Approximation of Graphs and Its Application to Link Prediction (pdf)

X. Sui, T. Lee, J. Whang, B. Savas, S. Jain, K. Pingali, I. Dhillon.*International Workshop on Languages and Compilers for Parallel Computing (LCPC)*, pp. 76-95, October 2012. - Scalable Clustering of Signed Networks using Balance Normalized Cut (pdf)

K. Chiang, J. Whang, I. Dhillon.

In*ACM Conference on Information and Knowledge Management (CIKM)*, pp. 615-624, October 2012.