BookmarkSubscribeRSS Feed
Ujjawal
Quartz | Level 8

I am working on building a fraud model for insurance company. I have close to 2000 frauds and more than 1M non-fraud claims. Some of the "non-fraud" claims are "fraud" claims in reality as some claims are incorrectly captured and tagged as "non-frauds" in data. I need to identify the claims which are likely to be fraud but tagged as non-fraud. I was thinking to find similarity (distance) between "fraud claims" and "non-fraud claims". If the similarity is low, it means these are non-fraud claims. Can clustering (k-mean) solve this problem? If i take k=2 and run k-mean clustering, ideally all [my fraud plus "can-be fraud"] claims and non-fraud claims should fall under different clusters. I have mixed variables so k-mean won't work properly. Any other algorithm to solve this problem?

2 REPLIES 2
Reeza
Super User

PROC DISCRIM and/or LOGISTIC REGRESSION. 

 

You have a small event rate though so you also need to account for that. 

 

FYI - Fraud analytics is essentially an unsupervised problem -> we don’t know exactly what the categories are. It’s in a lot of respects an unsolved problem to date and SAS has a Fraud Analytics tool specifically focused on Fraud Analytics. 

 

Are you using EM or Base SAS?

 


@Ujjawal wrote:

I am working on building a fraud model for insurance company. I have close to 2000 frauds and more than 1M non-fraud claims. Some of the "non-fraud" claims are "fraud" claims in reality as some claims are incorrectly captured and tagged as "non-frauds" in data. I need to identify the claims which are likely to be fraud but tagged as non-fraud. I was thinking to find similarity (distance) between "fraud claims" and "non-fraud claims". If the similarity is low, it means these are non-fraud claims. Can clustering (k-mean) solve this problem? If i take k=2 and run k-mean clustering, ideally all [my fraud plus "can-be fraud"] claims and non-fraud claims should fall under different clusters. I have mixed variables so k-mean won't work properly. Any other algorithm to solve this problem?


 

PGStats
Opal | Level 21

Look at HPSPLIT to build a simple model for fraud classification. Then look at the non-fraud cases missclassified as fraud. If you are right (and somewhat Lucky) some of those should be overlooked fraud cases. 

PG

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 2 replies
  • 1255 views
  • 0 likes
  • 3 in conversation