Fraud Detection Using Supervised Machine Learning

hakonstrand · Posted 03-15-2019 05:29 AM

Hello!

I am a student working on a project trying to identify fraud in E-Commerce transactions.

The fraud(target) are rare events (around 5 % of the observations) which leads the model to classify everything as not fraud.

I am thinking of a method of oversampling without losing all the observatoions. When I'm using sample to make observations of fraud and not fraud 50/50, the program takes only the fraudcases, 2000 observations, and 2000 random non fraud observations, making me lose the rest of the dataset, which is almost 25000 observations.

Is there a way of making the dataset 50/50 (fraud, not fraud) without removing observations? My thought is that it might just duplicate fraud observations in the training dataset (making it 50%/50%), leaving the test dataset as it was (95% not fraud, 5% fraud). Any step-by-step method of doing this in SAS?

I also have a lot of variables, almost 600. I am thinking of using PCA to get the most relevant of these. How can I do this using SAS?