# Divide & Conquer Methods for Big Data Analytics

### Project Summary

Recently, solving large-scale machine learning problems has become a very important issue. Many of the state-of-the-art approaches for such problems rely on numerical optimization, however, because of the scalability issues, usually one cannot directly apply classical optimization methods to solve large-scale problems. In this project, we apply divide and conquer scheme to handle big data. In the divide step, the large-scale problem is decomposed into several smaller subproblems. Each subproblem is defined only on a subset of data and can be efficiently solved. Solutions to the sub-problems are then combined to give a solution to the original problem. However, developing a divide and conquer algorithm is usually nontrivial — a good algorithm should partition data/variables in a certain way so that (1) the subproblems can be solved efficiently, and (2) the solutions to the subproblems can be easily combined to give the solution to the original problem. We demonstrate the efficiency of the divide-and-conquer scheme on several important machine learning topics: classification, kernel approximation, and link prediction.

### Project Description

1. A Divide-and-Conquer Solver for Kernel Support Vector Machines

Illustration of kernel SVM

The kernel support vector machine (SVM) is one of the most widely used classification methods; however, the amount of computation required becomes the bottleneck when facing millions of samples. We propose and analyze a novel divide-and-conquer solver for kernel SVMs (DC-SVM).

• Divide Step: In the divide step, we partition the kernel SVM problem into smaller subproblems by clustering the data, so that each subproblem can be solved independently and efficiently. Based on the theoretical analysis, we run kernel kmeans on subsamples to partition the data.
• Conquer Step:  In the conquer step, the local solutions from the subproblems are used to initialize a global coordinate descent solver, which converges quickly as suggested by our analysis.
• Early Termination: Prediction using the l-th level solution; faster training and prediction speed. The prediction accuracy is usually close to the global SVM solution.

Illustration of DC-SVM with early termination. Local models are combined to give the prediction.

Fast Training. By extending this idea, we develop a multilevel Divide-and-Conquer SVM algorithm which outperforms state-of-the-art methods in terms of training speed, testing accuracy, and memory usage. As an example, on the covtype dataset with half-a-million samples, DC-SVM is 7 times faster than LIBSVM in obtaining the exact SVM solution (to within 10-6 relative error) which achieves 96.15% prediction accuracy. Moreover, with our proposed early prediction strategy, DC-SVM achieves about 96% accuracy in only 12 minutes, which is more than 100 times faster than LIBSVM.

Timing results demonstrate that Divide & Conquer SVM (DC-SVM) is several orders of magnitude faster than other kernel SVM solvers.

Fast Prediction. We further improved the prediction speed for kernel machines by combining DC-SVM with the “pseudo landmark points” technique that reduce the prediction error without increasing prediction cost. As a result, on the Covertype dataset with half-a-million samples, DC-SVM requires only 10 minutes training time and just 22 inner products for predicting each sample (100 times faster than the state-of-the-art kernel SVM solvers in both training and prediction time) while achieving near-optimal prediction accuracy.

The Illustration of the DC-SVM algorithm with the new “pseudo landmark point” approach.

The comparison of kernel SVM algorithms in terms of both training and prediction time. DC-SVM is faster than other algorithms and achieves a better prediction accuracy.

2. Memory Efficient Kernel Approximation

The scalability of kernel machines is a big challenge when facing millions of samples due to storage and computation issues for large kernel matrices, that are usually dense. Recently, many papers have suggested tackling this problem by using a low-rank approximation of the kernel matrix. In this work, we first make the observation that the structure of shift-invariant kernels changes from low-rank to block-diagonal (without any low-rank structure) when varying the scale parameter. Based on this observation, we propose a new kernel approximation algorithm — Memory Efficient Kernel Approximation, which considers both low-rank and clustering structure of the kernel matrix.

Illustration of our proposed Memory Efficient Kernel Approximation.

We show that the resulting algorithm outperforms state-of-the-art low-rank kernel approximation methods in terms of speed, approximation error, and memory usage. As an example, on the MNIST dataset with two-million samples, our method takes 550 seconds on a single machine using less than 500 MBytes memory to achieve 0.2313 test RMSE for kernel ridge regression, while standard Nystr”{o}m approximation takes more than 2700 seconds and uses more than 2 GBytes memory on the same problem to achieve 0.2318 test RMSE.

The comparison of the approximation error between different kernel approximation approaches. Results show that MEKA achieves significant smaller kernel approximation error by exploiting both low rank and block structures.

3. Multi-Scale Spectral Decomposition

Computing the $k$ dominant eigenvalues and eigenvectors of massive graphs is a key operation in numerous machine learning applications; however, popular solvers suffer from slow convergence, especially when $k$ is reasonably large. In this project, we propose and analyze a novel multi-scale spectral decomposition method (MSEIGS), which first clusters the graph into smaller clusters whose spectral decomposition can be computed efficiently and independently.

Overview of MSEIGS. We first cluster the graph into smaller clusters whose spectral decomposition can be computed efficiently and independently. Then we use eigenvectors of the clusters as good initializations to compute spectral decomposition of the original graph.

We show theoretically as well as empirically that the union of all cluster’s subspaces has significant overlap with the dominant subspace of the original graph, provided that the graph is clustered appropriately. Thus, eigenvectors of the clusters serve as good initializations to a block Lanczos algorithm that is used to compute spectral decomposition of the original graph. We further use hierarchical clustering to speed up the computation and adopt a fast early termination strategy to compute quality approximations.

The k dominant eigenvectors approximation results showing time vs. average cosine of principal angles. For a given time, MSEIGS consistently yields better results than other methods.

Shared-memory multi-core results showing number of cores vs. time to compute similar approximation. MSEIGS achieves almost linear speedup and outperforms other methods.

Our method outperforms widely used solvers in terms of convergence speed and approximation quality. Furthermore, our method is naturally parallelizable and exhibits significant speedups in shared-memory parallel settings. For example, on a graph with more than 82 million nodes and 3.6 billion edges, MSEIGS takes less than 3 hours on a single-core machine while Randomized SVD takes more than 6 hours, to obtain a similar approximation of the top-50 eigenvectors. Using 16 cores, we can reduce this time to less than 40 minutes.

An important problem for social network analysis is proximity estimation that infers the “closeness” of different users. Proximity measures quantify the interaction between users based on the structural properties of a graph, such as the number of common friends. A key application for proximity measure in social networks is link prediction, which is a core problem in social network analysis.  Effective proximity measures, such as path based methods (e.g. Katz or rooted PageRank) are well known for their high computational complexity and memory usage. One approach is to perform dimensionality reduction on the original graph and then compute the proximity based on its low-rank approximation.