Re: Distance / Similarity with Events

Ujjawal · Posted 11-22-2017 04:50 PM

I am working on building a fraud model for insurance company. I have close to 2000 frauds and more than 1M non-fraud claims. Some of the "non-fraud" claims are "fraud" claims in reality as some claims are incorrectly captured and tagged as "non-frauds" in data. I need to identify the claims which are likely to be fraud but tagged as non-fraud. I was thinking to find similarity (distance) between "fraud claims" and "non-fraud claims". If the similarity is low, it means these are non-fraud claims. Can clustering (k-mean) solve this problem? If i take k=2 and run k-mean clustering, ideally all [my fraud plus "can-be fraud"] claims and non-fraud claims should fall under different clusters. I have mixed variables so k-mean won't work properly. Any other algorithm to solve this problem?

Reeza · Posted 11-22-2017 05:11 PM

PROC DISCRIM and/or LOGISTIC REGRESSION.

You have a small event rate though so you also need to account for that.

FYI - Fraud analytics is essentially an unsupervised problem -> we don’t know exactly what the categories are. It’s in a lot of respects an unsolved problem to date and SAS has a Fraud Analytics tool specifically focused on Fraud Analytics.

Are you using EM or Base SAS?

@Ujjawal wrote:

I am working on building a fraud model for insurance company. I have close to 2000 frauds and more than 1M non-fraud claims. Some of the "non-fraud" claims are "fraud" claims in reality as some claims are incorrectly captured and tagged as "non-frauds" in data. I need to identify the claims which are likely to be fraud but tagged as non-fraud. I was thinking to find similarity (distance) between "fraud claims" and "non-fraud claims". If the similarity is low, it means these are non-fraud claims. Can clustering (k-mean) solve this problem? If i take k=2 and run k-mean clustering, ideally all [my fraud plus "can-be fraud"] claims and non-fraud claims should fall under different clusters. I have mixed variables so k-mean won't work properly. Any other algorithm to solve this problem?

PGStats · Posted 11-22-2017 05:54 PM

Look at HPSPLIT to build a simple model for fraud classification. Then look at the non-fraud cases missclassified as fraud. If you are right (and somewhat Lucky) some of those should be overlooked fraud cases.

PG

Distance / Similarity with Events

Re: Distance / Similarity with Events

Re: Distance / Similarity with Events

Ready to join fellow brilliant minds for the SAS Hackathon?