topic Re: Distance / Similarity with Events in Statistical Procedures

Distance / Similarity with Events

Ujjawal — Wed, 22 Nov 2017 21:50:53 GMT

I am working on building a fraud model for insurance company. I have close to 2000 frauds and more than 1M non-fraud claims. Some of the "non-fraud" claims are "fraud" claims in reality as some claims are incorrectly captured and tagged as "non-frauds" in data. I need to identify the claims which are likely to be fraud but tagged as non-fraud. I was thinking to find similarity (distance) between "fraud claims" and "non-fraud claims". If the similarity is low, it means these are non-fraud claims. Can clustering (k-mean) solve this problem? If i take k=2 and run k-mean clustering, ideally all [my fraud plus "can-be fraud"] claims and non-fraud claims should fall under different clusters. I have mixed variables so k-mean won't work properly. Any other algorithm to solve this problem?

Re: Distance / Similarity with Events

Reeza — Wed, 22 Nov 2017 22:11:38 GMT

PROC DISCRIM and/or LOGISTIC REGRESSION.

You have a small event rate though so you also need to account for that.

FYI - Fraud analytics is essentially an unsupervised problem -> we don’t know exactly what the categories are. It’s in a lot of respects an unsolved problem to date and SAS has a Fraud Analytics tool specifically focused on Fraud Analytics.

Are you using EM or Base SAS?

@Ujjawal wrote:

I am working on building a fraud model for insurance company. I have close to 2000 frauds and more than 1M non-fraud claims. Some of the "non-fraud" claims are "fraud" claims in reality as some claims are incorrectly captured and tagged as "non-frauds" in data. I need to identify the claims which are likely to be fraud but tagged as non-fraud. I was thinking to find similarity (distance) between "fraud claims" and "non-fraud claims". If the similarity is low, it means these are non-fraud claims. Can clustering (k-mean) solve this problem? If i take k=2 and run k-mean clustering, ideally all [my fraud plus "can-be fraud"] claims and non-fraud claims should fall under different clusters. I have mixed variables so k-mean won't work properly. Any other algorithm to solve this problem?

Re: Distance / Similarity with Events

PGStats — Wed, 22 Nov 2017 22:54:51 GMT

Look at HPSPLIT to build a simple model for fraud classification. Then look at the non-fraud cases missclassified as fraud. If you are right (and somewhat Lucky) some of those should be overlooked fraud cases.