Solved: Re: Preparing categorical variable for logistic regression

Fae · Posted 03-26-2018 01:45 PM

I am trying to build a logistic regression model for campaign scoring.

This is for retail grocery..

What I have is 2 type of data.

Target event: customer_id, response(Y/N)

Customer Transaction data: customer_id, transaction_Date, channel, product_type, price

Where channel and product_type is categorical variable.

Usually I convert categorical variable into dummy variable, but in this case there's 20+channel and 100+ product_type, so i am not sure what to do. Do I do cluster analysis on the before categorical variable before i aggregate them to the customer level or if there a better way? Please help. Thanks.

Reeza · Posted 03-26-2018 03:49 PM

How big is your data set? One option would be to combine the data by categories and then create multiple variables.

ie

TotSpend_Sports, TotSpend_Food, TotSpend_Kids, TotSpend_Housewares, etc

View solution in original post

PaigeMiller · Posted 03-26-2018 01:57 PM

There's no need to create dummy variables for PROC LOGISTIC. The CLASS statement in PROC LOGISTIC will handle that for you.

If you have 100+ product types, this could be a problem for the analysis. If there is some logical way to group some of these product types together, I would give that a try.

--
Paige Miller

Fae · Posted 03-26-2018 02:30 PM

Thanks very much for your help.

To make sure I am not making a mistake. for proc Logistic, I can't have multiple data records for each customer so I need to prepare the data as below, right?

Customer1: Response, Predictor variable 1, ....., predictor variable N,

Customer2: Response, Predictor variable 1, ....., predictor variable N,

..

CustomerN: Response, Predictor variable 1, ....., predictor variable N,

The biggest issues I am facing is that since a customer can have multiple transactions, I don't know how to aggregate them to the customer level without losing the behavior information such as channel and product type.

I can easily aggregate the amount_spent by summing them, but what should I do if it's a category variable?

Thanks very much.

Reeza · Posted 03-26-2018 03:49 PM

How big is your data set? One option would be to combine the data by categories and then create multiple variables.

ie

TotSpend_Sports, TotSpend_Food, TotSpend_Kids, TotSpend_Housewares, etc

PGStats · Posted 03-26-2018 04:12 PM

You could either

1) Generate your predictor matrix with proc transpose - requires the replacement of missing values with zéros

or

2) Generate your predictor matrix with proc logistic designonly outdesign= - requires that you sum up predictor values for each Customer.

PG

StatDave · Posted 03-27-2018 01:51 PM

If you have repeated measurements on your customers, then you can use PROC GENMOD to fit the logistic model (use DIST=BIN option) and the REPEATED statement with your subject variable in the SUBJECT= option. That will fit a GEE model that adjusts for the correlation within subjects. If you have 100+ levels of a categorical predictor in the CLASS statement, this will probably cause model fitting problems. If there is any logical grouping of these levels into a smaller set of levels, then that is more likely to work. You could either use the DATA step with IF THEN ELSE statements to create a new grouping variable based on the old one, or you could use PROC FORMAT to create a format that groups the levels as desired.