Detecting Anomalous Records in Categorical Datasets
published: Sept. 14, 2007, recorded: September 2007, views: 164
Slides
Related content
16:51
542 views - Meghana Deodhar, 2007
16:29
128 views - Deepavali Bhagwat, 2007
15:28
164 views - Ramesh Nallapati, 2007
20:00
510 views - Xiuyao Song, 2007
02:23:20
1583 views - Arindam Banerjee, Aleksandar Lazarevic, Jaideep Srivastava, Vipin Kumar, Varun Chandola, 2008
01:01:41
2778 views - Kamal Nigam, 2006
13:05
193 views - Hannes Heikinheimo, 2007
17:15
68 views - Gaurav Tandon, 2007
03:54:31
15399 views - Chih-Jen Lin, 2006
21:27
368 views - Deepak Agarwal, 2007
Report a problem or upload files
If you have found a problem with this lecture or would like to send us extra material, articles, exercises, etc., please use our ticket system to describe your request and upload the data.Enter your e-mail into the 'Cc' field, and we will keep you updated with your request's status.
We are currently conducting a short survey. We value your feedback, and would appreciate if you took a few moments to respond to some questions. Click here to take the survey.
Description
We consider the problem of detecting anomalies in high arity categorical datasets. In most applications, anomalies are defined as data points that are ’abnormal’. Quite often we have access to data which consists mostly of normal records, along with a small percentage of unlabelled anomalous records. We are interested in the problem of unsupervised anomaly detection, where we use the unlabelled data for training, and detect records that do not follow the definition of normality. A standard approach is to create a model of normal data, and compare test records against it. A probabilistic approach builds a likelihood model from the training data. Records are tested for anomalousness based on the complete record likelihood given the probability model. For categorical attributes, bayes nets give a standard representation of the likelihood. While this approach is good at finding outliers in the dataset, it often tends to detect records with attribute values that are rare. Sometimes, just detecting rare values of an attribute is not desired and such outliers are not considered as anomalies in that context. We present an alternative definition of anomalies, and propose an approach of comparing against marginal distributions of attribute subsets. We show that this is a more meaningful way of detecting anomalies, and has a better performance over semi-synthetic as well as real world datasets.
See Also:
Launch in a standalone WM Player
Switch to Windows Media Player
Download slides:
categorical_kdd_kustav_das.ppt (1.2 MB)
Link this page
Would you like to put a link to this lecture on your homepage?Go ahead! Copy the HTML snippet !



Write your own review or comment: