BookmarkSubscribeRSS Feed
hakonstrand
Calcite | Level 5

Hello!

 

I am a student working on a project trying to identify fraud in E-Commerce transactions.

The fraud(target) are rare events (around 5 % of the observations) which leads the model to classify everything as not fraud.

I am thinking of a method of oversampling without losing all the observatoions. When I'm using sample to make observations of fraud and not fraud 50/50, the program takes only the fraudcases, 2000 observations, and 2000 random non fraud observations, making me lose the rest of the dataset, which is almost 25000 observations.

Is there a way of making the dataset 50/50 (fraud, not fraud) without removing observations? My thought is that it might just duplicate fraud observations in the training dataset (making it 50%/50%), leaving the test dataset as it was (95% not fraud, 5% fraud). Any step-by-step method of doing this in SAS?

 

I also have a lot of variables, almost 600. I am thinking of using PCA to get the most relevant of these. How can I do this using SAS?

2 REPLIES 2
Ksharp
Super User

Use PROC PLS or PROC HPGENSELECT to pick up variables.

MBRACH
Calcite | Level 5

In here you have some ideas how to deal with rare cases:

http://support.sas.com/documentation/cdl/en/emxndg/67980/HTML/default/viewer.htm#p1w6fewo0jhzxdn1ryt...

 

In here is a paper about nice method called SMOTE but unfortunately this version works only on continuous variables:

https://www.sas.com/content/dam/SAS/support/en/sas-global-forum-proceedings/2018/3604-2018.pdf

hackathon24-white-horiz.png

The 2025 SAS Hackathon Kicks Off on June 11!

Watch the live Hackathon Kickoff to get all the essential information about the SAS Hackathon—including how to join, how to participate, and expert tips for success.

YouTube LinkedIn

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 2 replies
  • 1544 views
  • 0 likes
  • 3 in conversation