Workshop on Stability and Resampling Methods for Clustering, Tübingen 2007

About

Model assessment is one of the most crucial aspects of statistical data analysis problems. In particular in data clustering it is difficult to devise reasonable tools for this purpose - the most prominent example is the problem of choosing the number k of clusters one wants to construct. Stability-based methods and resampling methods have become a popular choice for model selection in clustering, which is documented by the wealth of literature on this topic. The basic rationale of those approaches is that valid models should be reproducible under perturbation or resampling of the data. If high instability of models is observed, the inferred solution does not seem to be a generally valid model, or at least seems to have missed some important aspects of the data.

Many scientists report that stability and resampling methods work well for clustering model selection. Moreover, for supervised learning there is a wealth of literature that proves that stable classification algorithms have a good generalization performance. On the other hand, it has recently been claimed that stability methods for clustering can be misleading and do not necessarily work the way people believe they do. There is still an ongoing debate on how those results should be interpreted. But many researchers working on clustering stability methods agree that there is a lack of theoretical understanding for stability methods in clustering. In particular it seems unclear in which situations stability works and what the mechanism is which makes it a successful tool in those situations.

This lack of understanding is the motivation for holding a workshop on stability and resampling methods for clustering. We plan to hold a rather small workshop for specialists working on stability questions for clustering, or on stability-related questions in other areas of computer science or mathematics. We want to have a small number of invited talks, but want to dedicate a considerable amount of time to discussions. Hopefully, combining the expertise of people working on different aspects of stability and resampling will lead to a deeper understanding of this tool and its role with respect to clustering.

To guide the discussion, we would like to point out the following list of questions about the theory of stability methods for clustering:

* For which purposes can we use clustering stability? For example, (how) can stability be used for model selection?
* What is the mechanism which makes stability a valid tool in those situations?
* Can we characterize the situations when stability tools can be successful? Can we predict situations in which stability tools will not help at all or are misleading?
* What are the inherent limitations of stability approaches? What are assumptions we have to make?

Find out more at the workshop website.