Handling noisy data
published: Feb. 25, 2007, recorded: July 2005, views: 6843
Report a problem or upload filesIf you have found a problem with this lecture or would like to send us extra material, articles, exercises, etc., please use our ticket system to describe your request and upload the data.
Enter your e-mail into the 'Cc' field, and we will keep you updated with your request's status.
In the practice of machine learning, learning data typically contain errors. Imperfections in data can be due to various, often unavoidable causes: measurement errors, human mistakes, errors of expert judgement in classifying training examples etc. We refer to all of these as noise. Noise can also come from the treatment of missing values, when an example with unknown attribute value is replaced by a set of weighted examples corresponding to the probability distribution of the missing value. The typical consequences of noise in learning data are: (a) low prediction accuracy of induced hypotheses on new data, and (b) large hypotheses that are hard to interpret and to understand by the user. For example, decision trees with hundreds or thousands of nodes are not suitable for interpretation by the domain expert. We say that such complex hypotheses overfit the data. Overfitting occurs when the hypothesis not only reflects the genuine regularities in the domain, but it also traces noise in data. To alleviate the harmful effects of noise, we have to prevent overfitting. To do this, one common idea is to simplify induced hypotheses. In the learning of rules or decision trees, this leads to tree pruning or rule truncation. The main question in hypothesis simplification is: How can we know that our hypothesis is of “the right size”, not too simple and not too complex? For example in tree pruning, when should we stop the pruning? The decision can be based on the estimated accuracy of a hypothesis before pruning and after pruning, and then the estimated accuracy is maximised. However, estimating the accuracy can be difficult, and involves the problem of estimating probabilities from small samples. Several methods for this will be discussed in this lecture, and the effects of simplification will be illustrated. A somewhat related approach of deciding about the “right size” of a hypothesis is based on the minimum description length principle (MDL). Another way of reducing the effects of noise is to use background or prior knowledge about the domain of learning. For example, in the learning from numerical data, a useful idea is to make the learning algorithm respect the known qualitative properties of the target concept.
Link this pageWould you like to put a link to this lecture on your homepage?
Go ahead! Copy the HTML snippet !