Statistical techniques for fraud detection, prevention, and evaluation
published: Dec. 3, 2007, recorded: September 2007, views: 8438
Report a problem or upload filesIf you have found a problem with this lecture or would like to send us extra material, articles, exercises, etc., please use our ticket system to describe your request and upload the data.
Enter your e-mail into the 'Cc' field, and we will keep you updated with your request's status.
The talk begins by setting the context: fraud is defined and its breadth outlined; figures are given showing how significant fraud is; and different areas of fraud are examined, including health care fraud, banking fraud, and scientific fraud.
The particular data analytic challenges of banking fraud are described and illustrated in detail. These include the fact that the classes are highly unbalanced (with typically no more than 1 in a 1000 transactions being fraudulent), that class labels may often be incorrect, that there will typically be delays in discovering the true labels, that the transaction arrival times are random, that the data are dynamic, and, perhaps most challenging of all, that the distributions are reactive, changing in response to the implementation of fraud detection systems. The role of mechanistic and empirical models in tackling these problems is described. Both have been widely used, and both have a contribution to make.
Banking data, and in particular banking fraud data are examined in detail. Raw credit card transaction data have 70-80 variables per transaction, and this can be multiplied many-fold for behavioural data, as in fraud detection problems. Questions arise as to how to aggregate the data: should one try to classify individual transactions or should activity records be constructed?
A fundamental aspect of any predictive problem in data analysis is the choice of an appropriate criterion for estimation and performance assessment. In the case of fraud, one needs, in particular, to combine both classification accuracy and timeliness of classification. This means that standard measures of classification performance, such as error rate, AUC, KS statistic, information value, etc, are not sufficient. Suitable measures and performance curves are described which combine these aspects and which are now being adopted by the industry.
Various statistical (used here in John Chambers’s sense of ‘greater statistics’) approaches have been developed for fraud detection problems, and some are described and illustrated, using data from some of the banks which have been collaborating with us. In particular, we look at supervised classification and anomaly detection methods. Finally in the context of banking fraud, some of the deeper but very important conceptual issues are outlined, including the economic imperative, whether fraud is now becoming ‘acceptable’, and what exactly we learn from empirical comparisons, Scientific fraud is contrasted with banking fraud. They have rather different drivers. In particular, financial gain is generally irrelevant to scientific fraud, which makes it an unusual kind of fraud - although, of course, the impact can be even more serious. Several examples are given, from a range of disciplines. The role of data analytic tools in detecting scientific fraud, and the nature of such tools, is described
Link this pageWould you like to put a link to this lecture on your homepage?
Go ahead! Copy the HTML snippet !