Hello!
I am a student working on a project trying to identify fraud in E-Commerce transactions.
The fraud(target) are rare events (around 5 % of the observations) which leads the model to classify everything as not fraud.
I am thinking of a method of oversampling without losing all the observatoions. When I'm using sample to make observations of fraud and not fraud 50/50, the program takes only the fraudcases, 2000 observations, and 2000 random non fraud observations, making me lose the rest of the dataset, which is almost 25000 observations.
Is there a way of making the dataset 50/50 (fraud, not fraud) without removing observations? My thought is that it might just duplicate fraud observations in the training dataset (making it 50%/50%), leaving the test dataset as it was (95% not fraud, 5% fraud). Any step-by-step method of doing this in SAS?
I also have a lot of variables, almost 600. I am thinking of using PCA to get the most relevant of these. How can I do this using SAS?
Use PROC PLS or PROC HPGENSELECT to pick up variables.
In here you have some ideas how to deal with rare cases:
In here is a paper about nice method called SMOTE but unfortunately this version works only on continuous variables:
https://www.sas.com/content/dam/SAS/support/en/sas-global-forum-proceedings/2018/3604-2018.pdf
April 27 – 30 | Gaylord Texan | Grapevine, Texas
Walk in ready to learn. Walk out ready to deliver. This is the data and AI conference you can't afford to miss.
Register now and save with the early bird rate—just $795!
Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.
Find more tutorials on the SAS Users YouTube channel.