event thumbnail image
Tutorials

Predictive Modelling in the Wild: Success Factors in Data Mining Competitions and Real-World Applications

author: Claudia Perlich, IBM Watson Research Center
author: Saharon Rosset, Tel Aviv University

Description

In this tutorial, we give our perspective on the keys to success in application of predictive modeling to competitions like KDD Cup and real-life business intelligence projects. We argue that these two modes of applying predictive modeling share many similarities, but have also some important differences. We discuss the main success factors in predictive modeling: domain understanding, statistical acumen, and appropriate algorithmic approaches. We describe our relevant experiences in the context of three recent predictive modeling competitions where our team has had success (KDD Cup 2007 and 2008 and INFORMS DM challenge 2008) and two case studies of projects we have led at IBM Research. We also survey some of the recurring challenges and complexities in practical predictive modeling applications. One key issue is information leakage, and we discuss its definition, influence, detection and avoidance. We consider leakage to be the silent killer of many predictive modeling projects, and we demonstrate its impact on the competitions, and discuss the challenges in addressing it in the real-life projects. Other challenges include framing real-life modeling objectives into predictive modeling, and usefully applying relational learning concepts when modeling "real-life" complex, relational datasets.

You might be experiencing some problems with Your Video player.
Slides
0:00 Predictive Modeling in the Wild: Success Factors in Data Mining Competitions and Real-Life Projects
0:34 Predictive modeling
4:04 Predictive Modeling Competitions
5:46 The Goals of this Tutorial
7:08 Credentials in Data Mining Competitions
8:50 Experience with Real Life Projects
10:05 Outline
11:02 Introduction: What do you think is important?
11:10 Differences between competitions and projects
14:57 Real life project evolution and our focus
18:04 Two types of competitions
21:16 Factors of Success in Competitions and Real Life
23:41 Recurring challenges
26:49 1 Leakage in Predictive Modeling
29:58 Real life leakage example
32:37 General leakage solution: “predict the future”
34:34 2 Real-life performance measures
37:06 Optimizing real-life measures
38:43 3 Relational and Multi-Level Data
39:33 Approaches for dealing with relational data
41:39 Modeler’s best friend: Exploratory data analysis
43:07 The beauty and value of exploratory data analysis
44:28 Elements of EDA for predictive modeling
46:10 Case study #1: Netflix/KDD-Cup 2007
46:29 October 2006 Announcement of the NETFLIX Competition
48:23 NETFLIX Data - Internet Movie Data Base
50:08 NETFLIX data generation process
51:26 KDD-CUP 2007 based on the NETFLIX
52:34 Test sets from 2006 for Task 1 and Task 2
55:01 Task 1: Did User A review Movie B in 2006?
55:26 Task 2: How many reviews in 2006?
57:21 Some data observations
59:42 Test sets from 2006 for Task 1 and Task 2 (ctd.)
60:25 Some data observations (ctd.)
61:25 Some statistical perspectives
64:49 Some statistical perspectives (ctd.)
65:00 Test sets from 2006 for Task 1 and Task 2 (ctd.)
65:27 Some statistical perspectives (ctd.)
66:46 Modeling Approach Schema
71:28 Some observations on modeling approach
72:32 Modeling Approach Schema
72:44 Some details on our models and submission
73:48 Modeling Approach Schema
74:02 Some details on our models and submission
74:26 Competition evaluation
76:00 Competition evaluation (ctd.)
76:42 Some details on our models and submission
76:48 Competition evaluation (ctd.)
77:11 Effect of scaling on the two evaluation approaches (1)
78:41 Competition evaluation (ctd.)
78:48 Effect of scaling on the two evaluation approaches (1)
78:54 Effect of scaling on the two evaluation approaches (2)
79:17 KDD CUP 2007: Summary

Lecture rating

People found this lecture:
Worth seeing
because it is:
 Valuable and informative
Well presented
Easily understandable
Acceptably recorded
You need to login to cast your vote.

Report a problem or upload files

If you have found a problem with this lecture or would like to send us extra material, articles, exercises, etc., please use our ticket system to describe your request and upload the data.
Enter your e-mail into the 'Cc' field, and we will keep you updated with your request's status.

 Watch videos:   (click on thumbnail to launch)

Watch Part 1
Part 1 1:21:12
Flash video Windows Media video

!NOW PLAYING
Watch Part 2
Part 2 1:31:02
Flash video Windows Media video

Link this page

Would you like to put a link to this lecture on your homepage?
Go ahead! Copy the HTML snippet !

Write your own review or comment:

make sure you have javascript enabled or clear this field: