Hello! I am a student working on a project trying to identify fraud in E-Commerce transactions. The fraud(target) are rare events (around 5 % of the observations) which leads the model to classify everything as not fraud. I am thinking of a method of oversampling without losing all the observatoions. When I'm using sample to make observations of fraud and not fraud 50/50, the program takes only the fraudcases, 2000 observations, and 2000 random non fraud observations, making me lose the rest of the dataset, which is almost 25000 observations. Is there a way of making the dataset 50/50 (fraud, not fraud) without removing observations? My thought is that it might just duplicate fraud observations in the training dataset (making it 50%/50%), leaving the test dataset as it was (95% not fraud, 5% fraud). Any step-by-step method of doing this in SAS? I also have a lot of variables, almost 600. I am thinking of using PCA to get the most relevant of these. How can I do this using SAS?
... View more