I am trying to build a logistic regression model for campaign scoring.
This is for retail grocery..
What I have is 2 type of data.
Target event: customer_id, response(Y/N)
Customer Transaction data: customer_id, transaction_Date, channel, product_type, price
Where channel and product_type is categorical variable.
Usually I convert categorical variable into dummy variable, but in this case there's 20+channel and 100+ product_type, so i am not sure what to do. Do I do cluster analysis on the before categorical variable before i aggregate them to the customer level or if there a better way? Please help. Thanks.
How big is your data set? One option would be to combine the data by categories and then create multiple variables.
ie
TotSpend_Sports, TotSpend_Food, TotSpend_Kids, TotSpend_Housewares, etc
There's no need to create dummy variables for PROC LOGISTIC. The CLASS statement in PROC LOGISTIC will handle that for you.
If you have 100+ product types, this could be a problem for the analysis. If there is some logical way to group some of these product types together, I would give that a try.
Thanks very much for your help.
To make sure I am not making a mistake. for proc Logistic, I can't have multiple data records for each customer so I need to prepare the data as below, right?
Customer1: Response, Predictor variable 1, ....., predictor variable N,
Customer2: Response, Predictor variable 1, ....., predictor variable N,
..
CustomerN: Response, Predictor variable 1, ....., predictor variable N,
The biggest issues I am facing is that since a customer can have multiple transactions, I don't know how to aggregate them to the customer level without losing the behavior information such as channel and product type.
I can easily aggregate the amount_spent by summing them, but what should I do if it's a category variable?
Thanks very much.
How big is your data set? One option would be to combine the data by categories and then create multiple variables.
ie
TotSpend_Sports, TotSpend_Food, TotSpend_Kids, TotSpend_Housewares, etc
You could either
1) Generate your predictor matrix with proc transpose - requires the replacement of missing values with zéros
or
2) Generate your predictor matrix with proc logistic designonly outdesign= - requires that you sum up predictor values for each Customer.
If you have repeated measurements on your customers, then you can use PROC GENMOD to fit the logistic model (use DIST=BIN option) and the REPEATED statement with your subject variable in the SUBJECT= option. That will fit a GEE model that adjusts for the correlation within subjects. If you have 100+ levels of a categorical predictor in the CLASS statement, this will probably cause model fitting problems. If there is any logical grouping of these levels into a smaller set of levels, then that is more likely to work. You could either use the DATA step with IF THEN ELSE statements to create a new grouping variable based on the old one, or you could use PROC FORMAT to create a format that groups the levels as desired.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.