Successes, Failures and Learning From Them

author: Haym Hirsh, Rutgers, The State University of New Jersey
published: Aug. 16, 2007,   recorded: August 2007,   views: 391
Categories

Slides

Related Open Educational Resources

Related content

Report a problem or upload files

If you have found a problem with this lecture or would like to send us extra material, articles, exercises, etc., please use our ticket system to describe your request and upload the data.
Enter your e-mail into the 'Cc' field, and we will keep you updated with your request's status.
Lecture popularity: You need to login to cast your vote.
  Bibliography

Description

Another topic of interest here is to highlight some of the classic mistakes made in the field. Topics of interest here could range from the use of non-representative training data to the ignorance of population drift when modeling time-varying data, from not accounting for errors in data or labels in the model to an over reliance on a single technique for the task on hand and from asking the wrong question in the context of the application driver to sampling without care. A related topic here might be to think about the role of benchmark datasets and algorithms, and reflect on the general importance and requirement for repeatable and reproducible results.

See Also:

Download slides icon Download slides: kdd07_hirsh_sfle_01.pdf (542.0┬áKB)


Help icon Streaming Video Help

Link this page

Would you like to put a link to this lecture on your homepage?
Go ahead! Copy the HTML snippet !

Reviews and comments:

Comment1 David W. Aha, August 24, 2007 at 6:05 p.m.:

I really enjoyed seeing Haym's short presentation, and this so soon after the conference.

I had a minor quibble; it's incorrect to say that Pat Langley, as part of his work serving as inaugural editor of the Machine Learning journal, created the UCI Repository of ML Databases. Rather, what happened is that Jeff Schlimmer had collected a few database, and handed them over to me. I noticed others failing in their attempt to create a widely-used repository. Given this, I collected many more datasets, and then announced/publicized this as the UCI Repository (probably in 1988). (I passed it on in 1990 after finishing at UCI, and it's been in the hands of many other caretakers since that time. A Google Scholar citation search reveals over 4000, although this is so large as to seem incredible, and perhaps should be verified more systematically.) Pat was a strong supporter, and may have (I don't recall) encouraged some folks to send databases to me, although I recall mostly requesting these proactively. However, he did not create this repository, and it was not supported by any funding.

I completely agree with Haym's comments on the use of such datasets (i.e., some potential utility but its inherent limitations), having first heard this argument made strongly by Lorenza Saitta in her presentation at an ICML-95 workshop, although possibly earlier as attributed to members of the 1st generation case-based reasoning community (e.g., Janet Kolodner).

David W. Aha
24 August 2007

Write your own review or comment:

make sure you have javascript enabled or clear this field: