event thumbnail image
NATO Advanced Study Institute on Mining Massive Data Sets for Security

Statistical techniques for fraud detection, prevention, and evaluation

author: David Hand, Imperial College London

Description

The talk begins by setting the context: fraud is defined and its breadth outlined; figures are given showing how significant fraud is; and different areas of fraud are examined, including health care fraud, banking fraud, and scientific fraud.

The particular data analytic challenges of banking fraud are described and illustrated in detail. These include the fact that the classes are highly unbalanced (with typically no more than 1 in a 1000 transactions being fraudulent), that class labels may often be incorrect, that there will typically be delays in discovering the true labels, that the transaction arrival times are random, that the data are dynamic, and, perhaps most challenging of all, that the distributions are reactive, changing in response to the implementation of fraud detection systems. The role of mechanistic and empirical models in tackling these problems is described. Both have been widely used, and both have a contribution to make.

Banking data, and in particular banking fraud data are examined in detail. Raw credit card transaction data have 70-80 variables per transaction, and this can be multiplied many-fold for behavioural data, as in fraud detection problems. Questions arise as to how to aggregate the data: should one try to classify individual transactions or should activity records be constructed?

A fundamental aspect of any predictive problem in data analysis is the choice of an appropriate criterion for estimation and performance assessment. In the case of fraud, one needs, in particular, to combine both classification accuracy and timeliness of classification. This means that standard measures of classification performance, such as error rate, AUC, KS statistic, information value, etc, are not sufficient. Suitable measures and performance curves are described which combine these aspects and which are now being adopted by the industry.

Various statistical (used here in John Chambers’s sense of ‘greater statistics’) approaches have been developed for fraud detection problems, and some are described and illustrated, using data from some of the banks which have been collaborating with us. In particular, we look at supervised classification and anomaly detection methods. Finally in the context of banking fraud, some of the deeper but very important conceptual issues are outlined, including the economic imperative, whether fraud is now becoming ‘acceptable’, and what exactly we learn from empirical comparisons, Scientific fraud is contrasted with banking fraud. They have rather different drivers. In particular, financial gain is generally irrelevant to scientific fraud, which makes it an unusual kind of fraud - although, of course, the impact can be even more serious. Several examples are given, from a range of disciplines. The role of data analytic tools in detecting scientific fraud, and the nature of such tools, is described

You might be experiencing some problems with Your Video player.
Slides
0:00 Statistical techniques for fraud detection, prevention, and evaluation
0:21 Research group:
1:50 Outline
2:31 Context
3:02 I: Background
3:34 Fraud occurs in all areas of human endeavour
4:55 Social aspects of fraud management:
5:47 The economic imperative
7:58 If we cannot outspend the fraudsters we must out-think them
8:12 General problems in fraud detection (1)
10:17 General problems in fraud detection (2)
11:33 II: How big is fraud? (1)
12:12 II: How big is fraud? (2)
12:38 Cost of fraud
14:10 Does this matter to you personally? - Example 1: Identity theft
14:56 Identity theft in the USA
16:01 Example 2: Advance free fraud (the 419 scam)
16:54 How large are fraud datasets?
17:40 III: Fraud in banking
21:29 My main focus here is retail or consumer banking fraud
21:59 Nature of plastic card fraud data
24:18 Credit card data (70-80 variables per transaction):
25:18 A commercial example of fraud data
26:39 “Additional fraud-related variables which may also be considered are listed below” (1)
26:47 “Additional fraud-related variables which may also be considered are listed below” (2)
26:48 “Additional fraud-related variables which may also be considered are listed below” (3)
26:50 “Additional fraud-related variables which may also be considered are listed below” (4)
26:54 Unbalanced classes (1)
27:45 Unbalanced classes (2)
28:38 91% of suspected frauds are in fact legitimate
29:57 Delay in learning class labels
31:46 Mislabelled classes
33:14 Reactive population drift
35:31 e.g. variants of the 419 scam
35:53 Recall: Plastic card fraud in the UK (Gordon Blunt)
37:24 Reactive population drift example 1: Chip and PIN
39:47 Reactive population drift example 2: passwords
40:26 So they invented one-time passwords:
41:28 Our project:
42:47 What is a good system?
48:40 Different performance criteria may lead to different models
50:42 Distinguish between (1)
51:29 Distinguish between (2)
53:16 In itself, this would appear to be fine
53:27 Distinguish between (2)
53:34 In itself, this would appear to be fine
53:58 Distinguish between (2)
54:05 In itself, this would appear to be fine
54:39 A superior measure (1)
56:29 A superior measure (2)
57:44 A superior measure (3)
62:25 Performance plots
63:57 Constructing suspicion scores
66:16 Different approaches have different strengths and weaknesses
67:28 Some evidence for these things, but should be careful of generalising too freely
68:17 Rule-based methods
69:43 Supervised classification
70:26 Methods developed in several areas, including statistics, pattern recognition, machine learning, data mining
70:58 Example: Bank A: (Chris Whitrow)
72:53 Classification methods used in this study:
73:10 Two explorations:
76:04 Random performance
77:55 Limitations of such comparative studies
80:44 It is meaningless to evaluate methods out of context
80:57 One class modelling: outliers
83:22 Modelling the norm (1)
84:14 Modelling the norm (2)
85:07 There can be subtle complications
86:07 Example: Bank B: (Piotr Juszczak)
86:53 Preprocessing the categorical variables (MCC and ATM)
87:46 Similar for MCCs
88:49 Used several methods for building the pdfs:
89:04 Other, deeper questions
89:08 Like the poor, fraud is always with us
89:17 End
98:10 - Questions

Lecture rating

People found this lecture:
Worth seeing
because it is:
 Valuable and informative
Well presented
Easily understandable
Acceptably recorded
You need to login to cast your vote.

Report a problem or upload files

If you have found a problem with this lecture or would like to send us extra material, articles, exercises, etc., please use our ticket system to describe your request and upload the data.
Enter your e-mail into the 'Cc' field, and we will keep you updated with your request's status.

Link this page

Would you like to put a link to this lecture on your homepage?
Go ahead! Copy the HTML snippet !

Reviews and comments:

Comment1 Atif Abdul-Rahman, February 10, 2008 at 8:13 p.m.:

This is a very good presentation with good balance between breadth of coverage and specificity in issues like issues faced when building a model evaluation framework.

Mr. Hand's paper, Statistical Review of Fraud Detection, 2002 is also worth referencing.

Write your own review or comment:

make sure you have javascript enabled or clear this field: