event thumbnail image
NIPS '07 Workshop on Efficient Machine Learning
Pascal

Large Scale Learning with String Kernels

author: Sören Sonnenburg, Intelligent Data Analysis Group, Fraunhofer FIRST

Description

In applications of bioinformatics and text processing, such as splice site recognition and spam detection, large amounts of training sequences are available and needed to achieve sufficiently high prediction performance on classification or regression tasks. Although kernel-based methods such as SVMs often achieve state-of-the-art results, training and evaluation times may be prohibitively large. When single kernel computation time is already linear (w.r.t. the input sequences) it seems difficult to achieve further speed ups. In this work we describe an efficient technique for computing linear combinations of string kernels using sparse data structures such as explicit maps, sorted arrays and suffix tries, trees or arrays [5]. As computing linear combinations of kernels make up the dominant part of SVM training and evaluation, speeding up their computation is essential. Considering the recently proposed and successfully used linear time string kernels, like the Spectrum kernel [2] and the Weighted Spectrum kernel [3] we show that one can accelerate SVM training by factors of 7 and 60 times, respectively, while requiring considerably less memory. Our method allows us to train string kernel SVMs on sets as large as 10 million sequences [4]. Moreover, using these techniques the evaluation on new sequences is often several thousand times faster, allowing us to apply the classifiers on genome-sized data sets with seven billion test examples [6]. The presented algorithms are implemented in our Machine Learning toolbox SHOGUN for which the source code is publicly available at http://www.shogun-toolbox.org.
References: [1] D. Gusfield. Algorithms on strings, trees, and sequences. Cambridge University Press, 1997. [2] C. Leslie, E. Eskin, and W. S. Noble. The spectrum kernel: A string kernel for SVM protein classification. In R. B. Altman, A. K. Dunker, L. Hunter, K. Lauderdale, and T. E. Klein, editors, Proceedings of the Pacific Symposium on Biocomputing, pages 564–575, Kaua’i, Hawaii, 2002. [3] G. R¨atsch and S. Sonnenburg. Accurate Splice Site Prediction for Caenorhabditis Elegans, pages 277–298. MIT Press series on Computational Molecular Biology. MIT Press, 2004. [4] S. Sonnenburg, P. Philips, G. Schweikert, and G. R¨atsch. Accurate splice site prediction using support vector machines. BMC Bioinformatics, 8, 2007. [5] S. Sonnenburg, G. R¨atsch, and K. Rieck. Large-scale learning with string kernels. In L. Bottou, O. Chapelle, D. DeCoste, and J. Weston, editors, Large Scale Kernel Machines, Neural Information Processing Series, pages 73–104. MIT Press, Cambridge, MA, 2007. [6] S. Sonnenburg, A. Zien, and G. R¨atsch. ARTS: Accurate Recognition of Transcription Starts in Human. Bioinformatics, 22(14):e472–480, 2006.

You might be experiencing some problems with Your Video player.
Slides
0:00 Large Scale Learning with String Kernels
0:22 Large Scale Problems
1:26 Formally
1:48 String Kernels
2:37 Kernel Machine
4:01 Linadd Speedup Idea
5:17 Technical Remark
6:52 Datastructures - Summary of Computational Costs
8:23 Support Vector Machine
9:54 Derivation I
11:07 Derivation II
11:34 Algorithm
11:43 Parallelization
12:14 Datasets - 1
13:04 Web Spam Results
13:38 Datasets - 2
13:46 - Questions
13:58 Datasets - 2
14:15 Linadd for WD Kernel
14:28 Spectrum Kernel on Splice Data
15:02 Weighted Degree Kernel on Splice Data
15:24 More Data Helps
15:37 Discussion
16:25 Machine Learning Open Source Software

Lecture rating

People found this lecture:
Worth seeing
because it is:
 Valuable and informative
Well presented
Easily understandable
Acceptably recorded
You need to login to cast your vote.

Report a problem or upload files

If you have found a problem with this lecture or would like to send us extra material, articles, exercises, etc., please use our ticket system to describe your request and upload the data.
Enter your e-mail into the 'Cc' field, and we will keep you updated with your request's status.

Link this page

Would you like to put a link to this lecture on your homepage?
Go ahead! Copy the HTML snippet !

Write your own review or comment: