An Efficient Sampling Scheme For Comparison of Large Graphs

author: Karsten Michael Borgwardt, Max Planck Institute for Biological Cybernetics, Max Planck Institute
published: Sept. 5, 2007,   recorded: August 2007,   views: 3804
Categories

Report a problem or upload files

If you have found a problem with this lecture or would like to send us extra material, articles, exercises, etc., please use our ticket system to describe your request and upload the data.
Enter your e-mail into the 'Cc' field, and we will keep you updated with your request's status.
Lecture popularity: You need to login to cast your vote.

Description

As new graph structured data is being generated, graph comparison has become an important and challenging problem in application areas such as molecular biology, telecommunications, chemoinformatics, and social networks. Graph kernels have recently been proposed as a theoretically sound approach to this problem, and have been shown to achieve high accuracies on benchmark datasets. Different graph kernels compare different types of subgraphs in the input graphs. So far, the choice of subgraphs to compare is rather ad-hoc and is often motivated by runtime considerations. There is no clear indication that certain types of subgraphs are better than the others. On the other hand, comparing all possible subgraphs has been shown to be NP-hard, thus making it practically infeasible. These difficulties seriously limit the practical applicability of graph kernels. In this article, we attempt to rectify the situation, and make graph kernels applicable for data mining on large graphs and large datasets. Our starting point is the matrix reconstruction theorem, which states that any matrix of size 5 or above can be reconstructed given all its principal minors. By applying this to the adjacency matrix of a graph, we recursively define a graph kernel and show that it can be efficiently computed by using the distribution of all size 4 subgraphs of a graph. This distribution, we argue, is similar to a sufficient statistic of the graph, especially when the graph is large. Exhaustive enumeration of these subgraphs is prohibitively expensive, scaling as O(n4). But, by bounding the deviation of the empirical estimates of the distribution from the true distribution, it suffices to sample a fixed number of subgraphs. Incidentally, our bounds are stronger than those found in the bio-informatics literature for similar techniques. In our experimental evaluation, our graph kernel outperforms state-of-the-art graph kernels both in times of time and classification accuracy.