BookmarkSubscribeRSS Feed
hakonstrand
Calcite | Level 5

Hello!

 

I am a student working on a project trying to identify fraud in E-Commerce transactions.

The fraud(target) are rare events (around 5 % of the observations) which leads the model to classify everything as not fraud.

I am thinking of a method of oversampling without losing all the observatoions. When I'm using sample to make observations of fraud and not fraud 50/50, the program takes only the fraudcases, 2000 observations, and 2000 random non fraud observations, making me lose the rest of the dataset, which is almost 25000 observations.

Is there a way of making the dataset 50/50 (fraud, not fraud) without removing observations? My thought is that it might just duplicate fraud observations in the training dataset (making it 50%/50%), leaving the test dataset as it was (95% not fraud, 5% fraud). Any step-by-step method of doing this in SAS?

 

I also have a lot of variables, almost 600. I am thinking of using PCA to get the most relevant of these. How can I do this using SAS?

2 REPLIES 2
Ksharp
Super User

Use PROC PLS or PROC HPGENSELECT to pick up variables.

MBRACH
Calcite | Level 5

In here you have some ideas how to deal with rare cases:

http://support.sas.com/documentation/cdl/en/emxndg/67980/HTML/default/viewer.htm#p1w6fewo0jhzxdn1ryt...

 

In here is a paper about nice method called SMOTE but unfortunately this version works only on continuous variables:

https://www.sas.com/content/dam/SAS/support/en/sas-global-forum-proceedings/2018/3604-2018.pdf

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 2 replies
  • 1720 views
  • 0 likes
  • 3 in conversation