Predictive Modelling in the Wild: Success Factors in Data Mining Competitions and Real-World Applications
Description
In this tutorial, we give our perspective on the keys to success in application of predictive modeling to competitions like KDD Cup and real-life business intelligence projects. We argue that these two modes of applying predictive modeling share many similarities, but have also some important differences. We discuss the main success factors in predictive modeling: domain understanding, statistical acumen, and appropriate algorithmic approaches. We describe our relevant experiences in the context of three recent predictive modeling competitions where our team has had success (KDD Cup 2007 and 2008 and INFORMS DM challenge 2008) and two case studies of projects we have led at IBM Research. We also survey some of the recurring challenges and complexities in practical predictive modeling applications. One key issue is information leakage, and we discuss its definition, influence, detection and avoidance. We consider leakage to be the silent killer of many predictive modeling projects, and we demonstrate its impact on the competitions, and discuss the challenges in addressing it in the real-life projects. Other challenges include framing real-life modeling objectives into predictive modeling, and usefully applying relational learning concepts when modeling "real-life" complex, relational datasets.
| Slides | |
| 0:00 | Predictive Modeling in the Wild: Success Factors in Data Mining Competitions and Real-Life Projects |
| 0:34 | Predictive modeling |
| 4:04 | Predictive Modeling Competitions |
| 5:46 | The Goals of this Tutorial |
| 7:08 | Credentials in Data Mining Competitions |
| 8:50 | Experience with Real Life Projects |
| 10:05 | Outline |
| 11:02 | Introduction: What do you think is important? |
| 11:10 | Differences between competitions and projects |
| 14:57 | Real life project evolution and our focus |
| 18:04 | Two types of competitions |
| 21:16 | Factors of Success in Competitions and Real Life |
| 23:41 | Recurring challenges |
| 26:49 | 1 Leakage in Predictive Modeling |
| 29:58 | Real life leakage example |
| 32:37 | General leakage solution: “predict the future” |
| 34:34 | 2 Real-life performance measures |
| 37:06 | Optimizing real-life measures |
| 38:43 | 3 Relational and Multi-Level Data |
| 39:33 | Approaches for dealing with relational data |
| 41:39 | Modeler’s best friend: Exploratory data analysis |
| 43:07 | The beauty and value of exploratory data analysis |
| 44:28 | Elements of EDA for predictive modeling |
| 46:10 | Case study #1: Netflix/KDD-Cup 2007 |
| 46:29 | October 2006 Announcement of the NETFLIX Competition |
| 48:23 | NETFLIX Data - Internet Movie Data Base |
| 50:08 | NETFLIX data generation process |
| 51:26 | KDD-CUP 2007 based on the NETFLIX |
| 52:34 | Test sets from 2006 for Task 1 and Task 2 |
| 55:01 | Task 1: Did User A review Movie B in 2006? |
| 55:26 | Task 2: How many reviews in 2006? |
| 57:21 | Some data observations |
| 59:42 | Test sets from 2006 for Task 1 and Task 2 (ctd.) |
| 60:25 | Some data observations (ctd.) |
| 61:25 | Some statistical perspectives |
| 64:49 | Some statistical perspectives (ctd.) |
| 65:00 | Test sets from 2006 for Task 1 and Task 2 (ctd.) |
| 65:27 | Some statistical perspectives (ctd.) |
| 66:46 | Modeling Approach Schema |
| 71:28 | Some observations on modeling approach |
| 72:32 | Modeling Approach Schema |
| 72:44 | Some details on our models and submission |
| 73:48 | Modeling Approach Schema |
| 74:02 | Some details on our models and submission |
| 74:26 | Competition evaluation |
| 76:00 | Competition evaluation (ctd.) |
| 76:42 | Some details on our models and submission |
| 76:48 | Competition evaluation (ctd.) |
| 77:11 | Effect of scaling on the two evaluation approaches (1) |
| 78:41 | Competition evaluation (ctd.) |
| 78:48 | Effect of scaling on the two evaluation approaches (1) |
| 78:54 | Effect of scaling on the two evaluation approaches (2) |
| 79:17 | KDD CUP 2007: Summary |
Lecture rating
| People found this lecture: | ||
| Worth seeing | ||
| because it is: | ||
| Valuable and informative | ||
| Well presented | ||
| Easily understandable | ||
| Acceptably recorded | ||
| You need to login to cast your vote. | ||
Report a problem or upload files
If you have found a problem with this lecture or would like to send us extra material, articles, exercises, etc., please use our ticket system to describe your request and upload the data.Enter your e-mail into the 'Cc' field, and we will keep you updated with your request's status.
Related content
SEE ALSO:
Link this page
Would you like to put a link to this lecture on your homepage?Go ahead! Copy the HTML snippet !





