The Use of Randomization and Statistical Significance in Data Mining

author: Kai Puolamäki, Finnish Institute of Occupational Health
published: Jan. 16, 2013,   recorded: December 2012,   views: 3149


Related Open Educational Resources

Related content

Report a problem or upload files

If you have found a problem with this lecture or would like to send us extra material, articles, exercises, etc., please use our ticket system to describe your request and upload the data.
Enter your e-mail into the 'Cc' field, and we will keep you updated with your request's status.
Lecture popularity: You need to login to cast your vote.


The concept and theory of statistical significance testing is well established in a traditional setup, but not in the problem settings related to data mining. In this talk I discuss the formulation as well as advantages and limitations of the statistical significance testing approaches in data mining. A data mining problem, where the objective is to find patterns such as frequent sets or clusterings, can be formulated as a statistical significance testing problem if one can (i) define a null hypothesis, (ii) formulate a reasonable test statistic(s), and (iii) either map patterns to constraints to null hypothesis or map each pattern a test statistic of its own, the latter case resulting to a multiple hypothesis testing setup. The data mining problems rarely allow analytic formulation; hence, randomization methods that are needed to sample the null hypothesis are a key ingredient. The significance testing approach can be viewed as a way to solve the regularization problem in data analysis: it is possible to use a relatively simple and efficient data mining algorithms, and use the significance testing to take the noise into account in a principled manner. The significance testing is especially suitable to explorative data analysis, because it is possible, on one hand, to use constraints to null hypothesis to incorporate what we know about the data, and on other hand, to find patterns that tell most of what we do not yet know about the data. We present formulations of the exploration problem with some some theoretical results.

Link this page

Would you like to put a link to this lecture on your homepage?
Go ahead! Copy the HTML snippet !

Write your own review or comment:

make sure you have javascript enabled or clear this field: