Coresets for Nonparametric Estimation - the Case of DP-Means

author: Olivier Bachem, Department of Computer Science, ETH Zurich
published: Sept. 27, 2015,   recorded: July 2015,   views: 1790

Related Open Educational Resources

Related content

Report a problem or upload files

If you have found a problem with this lecture or would like to send us extra material, articles, exercises, etc., please use our ticket system to describe your request and upload the data.
Enter your e-mail into the 'Cc' field, and we will keep you updated with your request's status.
Lecture popularity: You need to login to cast your vote.


Scalable training of Bayesian nonparametric models is a notoriously difficult challenge. We explore the use of coresets - a data summarization technique originating from computational geometry - for this task. Coresets are weighted subsets of the data such that models trained on these coresets are provably competitive with models trained on the full dataset. Coresets sublinear in the dataset size allow for fast approximate inference with provable guarantees. Existing constructions, however, are limited to parametric problems. Using novel techniques in coreset construction we show the existence of coresets for DP-Means - a prototypical nonparametric clustering problem - and provide a practical construction algorithm. We empirically demonstrate that our algorithm allows us to efficiently trade off computation time and approximation error and thus scale DP-Means to large datasets. For instance, with coresets we can obtain a computational speedup of 45x at an approximation error of only 2.4% compared to solving on the full data set. In contrast, for the same subsample size, the “naive” approach of uniformly subsampling the data incurs an approximation error of 22.5%.

Link this page

Would you like to put a link to this lecture on your homepage?
Go ahead! Copy the HTML snippet !

Write your own review or comment:

make sure you have javascript enabled or clear this field: