Hello!
I am a student working on a project trying to identify fraud in E-Commerce transactions.
The fraud(target) are rare events (around 5 % of the observations) which leads the model to classify everything as not fraud.
I am thinking of a method of oversampling without losing all the observatoions. When I'm using sample to make observations of fraud and not fraud 50/50, the program takes only the fraudcases, 2000 observations, and 2000 random non fraud observations, making me lose the rest of the dataset, which is almost 25000 observations.
Is there a way of making the dataset 50/50 (fraud, not fraud) without removing observations? My thought is that it might just duplicate fraud observations in the training dataset (making it 50%/50%), leaving the test dataset as it was (95% not fraud, 5% fraud). Any step-by-step method of doing this in SAS?
I also have a lot of variables, almost 600. I am thinking of using PCA to get the most relevant of these. How can I do this using SAS?
Use PROC PLS or PROC HPGENSELECT to pick up variables.
In here you have some ideas how to deal with rare cases:
In here is a paper about nice method called SMOTE but unfortunately this version works only on continuous variables:
https://www.sas.com/content/dam/SAS/support/en/sas-global-forum-proceedings/2018/3604-2018.pdf
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 16. Read more here about why you should contribute and what is in it for you!
Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.
Find more tutorials on the SAS Users YouTube channel.