Large Scale Learning with String Kernels

author: Sören Sonnenburg, Intelligent Data Analysis Group, Fraunhofer Institute for Intelligent Analysis and Information Systems
published: Dec. 29, 2007,   recorded: December 2007,   views: 8836


Related Open Educational Resources

Related content

Report a problem or upload files

If you have found a problem with this lecture or would like to send us extra material, articles, exercises, etc., please use our ticket system to describe your request and upload the data.
Enter your e-mail into the 'Cc' field, and we will keep you updated with your request's status.
Lecture popularity: You need to login to cast your vote.


In applications of bioinformatics and text processing, such as splice site recognition and spam detection, large amounts of training sequences are available and needed to achieve sufficiently high prediction performance on classification or regression tasks. Although kernel-based methods such as SVMs often achieve state-of-the-art results, training and evaluation times may be prohibitively large. When single kernel computation time is already linear (w.r.t. the input sequences) it seems difficult to achieve further speed ups. In this work we describe an efficient technique for computing linear combinations of string kernels using sparse data structures such as explicit maps, sorted arrays and suffix tries, trees or arrays [5]. As computing linear combinations of kernels make up the dominant part of SVM training and evaluation, speeding up their computation is essential. Considering the recently proposed and successfully used linear time string kernels, like the Spectrum kernel [2] and the Weighted Spectrum kernel [3] we show that one can accelerate SVM training by factors of 7 and 60 times, respectively, while requiring considerably less memory. Our method allows us to train string kernel SVMs on sets as large as 10 million sequences [4]. Moreover, using these techniques the evaluation on new sequences is often several thousand times faster, allowing us to apply the classifiers on genome-sized data sets with seven billion test examples [6]. The presented algorithms are implemented in our Machine Learning toolbox SHOGUN for which the source code is publicly available at
References: [1] D. Gusfield. Algorithms on strings, trees, and sequences. Cambridge University Press, 1997. [2] C. Leslie, E. Eskin, and W. S. Noble. The spectrum kernel: A string kernel for SVM protein classification. In R. B. Altman, A. K. Dunker, L. Hunter, K. Lauderdale, and T. E. Klein, editors, Proceedings of the Pacific Symposium on Biocomputing, pages 564–575, Kaua’i, Hawaii, 2002. [3] G. R¨atsch and S. Sonnenburg. Accurate Splice Site Prediction for Caenorhabditis Elegans, pages 277–298. MIT Press series on Computational Molecular Biology. MIT Press, 2004. [4] S. Sonnenburg, P. Philips, G. Schweikert, and G. R¨atsch. Accurate splice site prediction using support vector machines. BMC Bioinformatics, 8, 2007. [5] S. Sonnenburg, G. R¨atsch, and K. Rieck. Large-scale learning with string kernels. In L. Bottou, O. Chapelle, D. DeCoste, and J. Weston, editors, Large Scale Kernel Machines, Neural Information Processing Series, pages 73–104. MIT Press, Cambridge, MA, 2007. [6] S. Sonnenburg, A. Zien, and G. R¨atsch. ARTS: Accurate Recognition of Transcription Starts in Human. Bioinformatics, 22(14):e472–480, 2006.

See Also:

Download slides icon Download slides: eml07_sonnenburg_lsl_01.pdf (759.3 KB)

Help icon Streaming Video Help

Link this page

Would you like to put a link to this lecture on your homepage?
Go ahead! Copy the HTML snippet !

Write your own review or comment:

make sure you have javascript enabled or clear this field: