05-27-2015 01:25 PM
I am doing am online fraud classification case where i have 364 variables and 50.000 observations. 300 of these variables are binary product variables thus indicating if the purchase made was of a specific product or not. I am thinking that there must be some information hidden in these variables but i can figure out a good way of dealing with them. Does anyone have an idea?
05-27-2015 02:26 PM
MBA - market basket analysis - which products are likely to be batched together?
If only one of the 300 is filled for every observation then change the data structure to have a product variable instead?
05-28-2015 10:36 AM
As Reeza pointed out , Maybe You could encode that category variable into a numeric variable by proc glmselect , then fit it in model.
05-29-2015 10:58 AM
There is a data mining approach for rare events, often used to flag fraud. Give it a try not transforming or reject variables just yet. Try clustering your data and if you have a few flagged or confirmed fraud cases, you can train a predictive model for each cluster. You are hoping that your fraudsters have different patterns than the rest of your customers, and you would have a higher concentration of fraudsters in certain clusters.
Make sure your cluster makes sense and decide whether you need to standardize or tweak your clustering. For your 300 binary variables you do not need to standardize but do standardize if you have other inputs in really different scales.
SAS® Does Data Science: How to Succeed in a Data Science Competition
Compare this approach to Reeza's and Xia's suggestions.