Statistical techniques for fraud detection, prevention, and evaluation
Description
The talk begins by setting the context: fraud is defined and its breadth outlined; figures
are given showing how significant fraud is; and different areas of fraud are examined,
including health care fraud, banking fraud, and scientific fraud. The particular data analytic challenges of banking fraud are described and illustrated
in detail. These include the fact that the classes are highly unbalanced (with typically
no more than 1 in a 1000 transactions being fraudulent), that class labels may
often be incorrect, that there will typically be delays in discovering the true labels, that
the transaction arrival times are random, that the data are dynamic, and, perhaps most
challenging of all, that the distributions are reactive, changing in response to the implementation
of fraud detection systems. The role of mechanistic and empirical models
in tackling these problems is described. Both have been widely used, and both have a
contribution to make. Banking data, and in particular banking fraud data are examined in detail. Raw
credit card transaction data have 70-80 variables per transaction, and this can be multiplied
many-fold for behavioural data, as in fraud detection problems. Questions arise as
to how to aggregate the data: should one try to classify individual transactions or should
activity records be constructed? A fundamental aspect of any predictive problem in data analysis is the choice of an
appropriate criterion for estimation and performance assessment. In the case of fraud,
one needs, in particular, to combine both classification accuracy and timeliness of classification.
This means that standard measures of classification performance, such as
error rate, AUC, KS statistic, information value, etc, are not sufficient. Suitable measures
and performance curves are described which combine these aspects and which are
now being adopted by the industry. Various statistical (used here in John Chambers’s sense of ‘greater statistics’) approaches
have been developed for fraud detection problems, and some are described
and illustrated, using data from some of the banks which have been collaborating with
us. In particular, we look at supervised classification and anomaly detection methods.
Finally in the context of banking fraud, some of the deeper but very important conceptual
issues are outlined, including the economic imperative, whether fraud is now
becoming ‘acceptable’, and what exactly we learn from empirical comparisons,
Scientific fraud is contrasted with banking fraud. They have rather different drivers.
In particular, financial gain is generally irrelevant to scientific fraud, which makes it
an unusual kind of fraud - although, of course, the impact can be even more serious.
Several examples are given, from a range of disciplines. The role of data analytic tools
in detecting scientific fraud, and the nature of such tools, is described
| Slides | |
| 0:00 | Statistical techniques for fraud detection, prevention, and evaluation |
| 0:21 | Research group: |
| 1:50 | Outline |
| 2:31 | Context |
| 3:02 | I: Background |
| 3:34 | Fraud occurs in all areas of human endeavour |
| 4:55 | Social aspects of fraud management: |
| 5:47 | The economic imperative |
| 7:58 | If we cannot outspend the fraudsters we must out-think them |
| 8:12 | General problems in fraud detection (1) |
| 10:17 | General problems in fraud detection (2) |
| 11:33 | II: How big is fraud? (1) |
| 12:12 | II: How big is fraud? (2) |
| 12:38 | Cost of fraud |
| 14:10 | Does this matter to you personally? - Example 1: Identity theft |
| 14:56 | Identity theft in the USA |
| 16:01 | Example 2: Advance free fraud (the 419 scam) |
| 16:54 | How large are fraud datasets? |
| 17:40 | III: Fraud in banking |
| 21:29 | My main focus here is retail or consumer banking fraud |
| 21:59 | Nature of plastic card fraud data |
| 24:18 | Credit card data (70-80 variables per transaction): |
| 25:18 | A commercial example of fraud data |
| 26:39 | “Additional fraud-related variables which may also be considered are listed below” (1) |
| 26:47 | “Additional fraud-related variables which may also be considered are listed below” (2) |
| 26:48 | “Additional fraud-related variables which may also be considered are listed below” (3) |
| 26:50 | “Additional fraud-related variables which may also be considered are listed below” (4) |
| 26:54 | Unbalanced classes (1) |
| 27:45 | Unbalanced classes (2) |
| 28:38 | 91% of suspected frauds are in fact legitimate |
| 29:57 | Delay in learning class labels |
| 31:46 | Mislabelled classes |
| 33:14 | Reactive population drift |
| 35:31 | e.g. variants of the 419 scam |
| 35:53 | Recall: Plastic card fraud in the UK (Gordon Blunt) |
| 37:24 | Reactive population drift example 1: Chip and PIN |
| 39:47 | Reactive population drift example 2: passwords |
| 40:26 | So they invented one-time passwords: |
| 41:28 | Our project: |
| 42:47 | What is a good system? |
| 48:40 | Different performance criteria may lead to different models |
| 50:42 | Distinguish between (1) |
| 51:29 | Distinguish between (2) |
| 53:16 | In itself, this would appear to be fine |
| 53:27 | Distinguish between (2) |
| 53:34 | In itself, this would appear to be fine |
| 53:58 | Distinguish between (2) |
| 54:05 | In itself, this would appear to be fine |
| 54:39 | A superior measure (1) |
| 56:29 | A superior measure (2) |
| 57:44 | A superior measure (3) |
| 62:25 | Performance plots |
| 63:57 | Constructing suspicion scores |
| 66:16 | Different approaches have different strengths and weaknesses |
| 67:28 | Some evidence for these things, but should be careful of generalising too freely |
| 68:17 | Rule-based methods |
| 69:43 | Supervised classification |
| 70:26 | Methods developed in several areas, including statistics, pattern recognition, machine learning, data mining |
| 70:58 | Example: Bank A: (Chris Whitrow) |
| 72:53 | Classification methods used in this study: |
| 73:10 | Two explorations: |
| 76:04 | Random performance |
| 77:55 | Limitations of such comparative studies |
| 80:44 | It is meaningless to evaluate methods out of context |
| 80:57 | One class modelling: outliers |
| 83:22 | Modelling the norm (1) |
| 84:14 | Modelling the norm (2) |
| 85:07 | There can be subtle complications |
| 86:07 | Example: Bank B: (Piotr Juszczak) |
| 86:53 | Preprocessing the categorical variables (MCC and ATM) |
| 87:46 | Similar for MCCs |
| 88:49 | Used several methods for building the pdfs: |
| 89:04 | Other, deeper questions |
| 89:08 | Like the poor, fraud is always with us |
| 89:17 | End |
| 98:10 | - Questions |
Lecture rating
| People found this lecture: | ||
| Worth seeing | ||
| because it is: | ||
| Valuable and informative | ||
| Well presented | ||
| Easily understandable | ||
| Acceptably recorded | ||
| You need to login to cast your vote. | ||
Report a problem or upload files
If you have found a problem with this lecture or would like to send us extra material, articles, exercises, etc., please use our ticket system to describe your request and upload the data.Enter your e-mail into the 'Cc' field, and we will keep you updated with your request's status.
Related content
SEE ALSO:
Link this page
Would you like to put a link to this lecture on your homepage?Go ahead! Copy the HTML snippet !






This is a very good presentation with good balance between breadth of coverage and specificity in issues like issues faced when building a model evaluation framework.
Mr. Hand's paper, Statistical Review of Fraud Detection, 2002 is also worth referencing.