event thumbnail image
NATO Advanced Study Institute on Mining Massive Data Sets for Security

Mining Massive Data Sets

Description

Today, the amount of data coming from all possible sources is enormous and growing at a fast pace due, in large part, to the ubiquitous Web and its increasing presence in our everyday life; but also to emails, cell phones, credit cards, retail, finance ... These data serve all sorts of functions : from query and search, to extracting information, providing services as well as managing security. Many fields are involved : statistics, data mining, text mining, data streams, search, social networks ... There is no lack of sophisticated techniques produced by academic activity, where challenges mostly deal with novelty, accuracy, and scalability of algorithms. However, in real-world applications, challenges are quite different : scalability (usually one or two orders of magnitude more than in academic publications), ease-of-use and capability to integrate efficient techniques into working systems in a transparent way, while always producing value for the customer. Real-world solutions are complex and usually need to integrate many technical components, from the various fields mentioned before: it thus becomes important to assess how these fields can complement one another. In the first part of the talk, I will present the challenges of real-world data mining applications. I will introduce the general Statistical Learning Theory framework and discuss some of the technical issues involved (large dimension data sets, missing data, outliers, non-i.i.d. structured data, unlabelled data ...) In the second part, I will show, taking examples from the implementation in KXEN and applications developed, how a theoretical framework (Structural Risk Minimization [1]) can be used to solve some of the challenges met in the real-world. I will finally describe some open practical issues which will require further theoretical investigation.

You might be experiencing some problems with Your Video player.
Slides
0:00 Mining Massive Data Sets
0:07 Agenda
1:12 A little bit of history – Data mining & NATO
5:23 A little bit of history
6:30 A little bit of history (1)
7:41 A little bit of history (2)
10:03 Data
10:56 A little bit of history (2)
11:19 Data
14:25 Data (1)
16:57 Data (2)
17:43 Data (3)
19:37 Yahoo! Data – A league of its own …
21:36 Functions
25:11 Map of the workshop
26:30 Agenda
27:53 What are the issues in the real-world ?
30:44 What are the issues in the real-world? (1)
35:11 Data mining in practice
37:20 Data mining in practice (1)
42:40 Data mining in practice (2)
46:11 Data mining in practice (1)
46:32 Data mining in practice (2)
49:00 Challenges for the real-world
58:34 Vapnik’s Statistical Learning Theory
60:21 Vapnik’s Statistical Learning Theory (1)
62:32 Vapnik’s Statistical Learning Theory (2)
62:43 Vapnik’s Statistical Learning Theory (3)
69:46 Vapnik’s Statistical Learning Theory (4)
71:11 Vapnik’s Statistical Learning Theory (5)
74:53 Vapnik’s Statistical Learning Theory (6)
77:58 Vapnik’s Statistical Learning Theory (7)
80:17 Vapnik’s Statistical Learning Theory (8)
82:36 Vapnik's Statistical Learning Theory (9)
83:56 Vapnik’s Statistical Learning Theory (10)
85:51 Structural Risk Minimization
86:35 Structural Risk Minimization (1)
87:07 Structural Risk Minimization (2)
88:07 Structural Risk Minimization (3)
88:57 KXEN implementation
89:51 KXEN implementation (1)
90:26 Modelization process in KXEN
90:38 Modelization process in KXEN (1)
91:01 Modelization process in KXEN (2)
93:25 Modelization process in KXEN (3)
93:41 Modelization process in KXEN (4)
94:45 Modelization process in KXEN (5)
95:36 Modelization process in KXEN (6)
96:17 Modelization process in KXEN (7)
97:17 Agenda
97:20 Using textual variables
100:24 Using textual variables - DataMining Cup'06
113:57 Using textual variables - DataMining Cup'06 (1)
116:00 Large telco operator
119:34 Large telco operator (1)
130:04 - Questions

Lecture rating

People found this lecture:
Worth seeing
because it is:
 Valuable and informative
Well presented
Easily understandable
Acceptably recorded
You need to login to cast your vote.

Report a problem or upload files

If you have found a problem with this lecture or would like to send us extra material, articles, exercises, etc., please use our ticket system to describe your request and upload the data.
Enter your e-mail into the 'Cc' field, and we will keep you updated with your request's status.

Link this page

Would you like to put a link to this lecture on your homepage?
Go ahead! Copy the HTML snippet !

Write your own review or comment: