Mining Massive Data Sets
published: Nov. 26, 2007, recorded: September 2007, views: 9967
Report a problem or upload filesIf you have found a problem with this lecture or would like to send us extra material, articles, exercises, etc., please use our ticket system to describe your request and upload the data.
Enter your e-mail into the 'Cc' field, and we will keep you updated with your request's status.
Today, the amount of data coming from all possible sources is enormous and growing at a fast pace due, in large part, to the ubiquitous Web and its increasing presence in our everyday life; but also to emails, cell phones, credit cards, retail, finance ... These data serve all sorts of functions : from query and search, to extracting information, providing services as well as managing security. Many fields are involved : statistics, data mining, text mining, data streams, search, social networks ... There is no lack of sophisticated techniques produced by academic activity, where challenges mostly deal with novelty, accuracy, and scalability of algorithms. However, in real-world applications, challenges are quite different : scalability (usually one or two orders of magnitude more than in academic publications), ease-of-use and capability to integrate efficient techniques into working systems in a transparent way, while always producing value for the customer. Real-world solutions are complex and usually need to integrate many technical components, from the various fields mentioned before: it thus becomes important to assess how these fields can complement one another. In the first part of the talk, I will present the challenges of real-world data mining applications. I will introduce the general Statistical Learning Theory framework and discuss some of the technical issues involved (large dimension data sets, missing data, outliers, non-i.i.d. structured data, unlabelled data ...) In the second part, I will show, taking examples from the implementation in KXEN and applications developed, how a theoretical framework (Structural Risk Minimization ) can be used to solve some of the challenges met in the real-world. I will finally describe some open practical issues which will require further theoretical investigation.
Link this pageWould you like to put a link to this lecture on your homepage?
Go ahead! Copy the HTML snippet !