Programming the statistical procedures from SAS

Preparing categorical variable for logistic regression

Accepted Solution Solved
Reply
Occasional Contributor Fae
Occasional Contributor
Posts: 14
Accepted Solution

Preparing categorical variable for logistic regression

I am trying to build a logistic regression model for campaign scoring.

 

This is for retail grocery..

 

What I have is 2 type of data.

 

Target event:                              customer_id, response(Y/N)

Customer Transaction data:      customer_id, transaction_Date, channel, product_type, price

 

Where channel and product_type is categorical variable.

 

Usually I convert  categorical variable into dummy variable, but in this case there's 20+channel and 100+ product_type, so i am not sure what to do.  Do I do cluster analysis on the before categorical variable before i aggregate them to the customer level or if there a better way?  Please help.  Thanks.

 


Accepted Solutions
Solution
‎05-09-2018 11:31 AM
Super User
Posts: 23,677

Re: Preparing categorical variable for logistic regression

How big is your data set? One option would be to combine the data by categories and then create multiple variables. 

 

ie 

 

TotSpend_Sports, TotSpend_Food, TotSpend_Kids, TotSpend_Housewares, etc

View solution in original post


All Replies
Respected Advisor
Posts: 2,985

Re: Preparing categorical variable for logistic regression

There's no need to create dummy variables for PROC LOGISTIC. The CLASS statement in PROC LOGISTIC will handle that for you.

 

If you have 100+ product types, this could be a problem for the analysis. If there is some logical way to group some of these product types together, I would give that a try.

--
Paige Miller
Occasional Contributor Fae
Occasional Contributor
Posts: 14

Re: Preparing categorical variable for logistic regression

Posted in reply to PaigeMiller

Thanks very much for your help.  

 

To make sure I am not making a mistake.  for proc Logistic,  I can't have multiple data records for each customer so I need to prepare the data as below, right?

 

Customer1: Response, Predictor variable 1, ....., predictor variable N, 

Customer2: Response, Predictor variable 1, ....., predictor variable N, 

..

CustomerN: Response, Predictor variable 1, ....., predictor variable N, 

 

 

The biggest issues I am facing is that since a customer can have multiple transactions, I don't know how to aggregate them to the customer level without losing the behavior information such as channel and product type.

 

I can easily aggregate the amount_spent by summing them, but what should I do if it's a category variable?

 

Thanks very much.

Solution
‎05-09-2018 11:31 AM
Super User
Posts: 23,677

Re: Preparing categorical variable for logistic regression

How big is your data set? One option would be to combine the data by categories and then create multiple variables. 

 

ie 

 

TotSpend_Sports, TotSpend_Food, TotSpend_Kids, TotSpend_Housewares, etc

Esteemed Advisor
Posts: 5,521

Re: Preparing categorical variable for logistic regression

You could either

 

1) Generate your predictor matrix with proc transpose - requires the replacement of missing values with zéros

or

2) Generate your predictor matrix with proc logistic designonly outdesign= - requires that you sum up predictor values for each Customer.

PG
SAS Employee
Posts: 384

Re: Preparing categorical variable for logistic regression

If you have repeated measurements on your customers, then you can use PROC GENMOD to fit the logistic model (use DIST=BIN option) and the REPEATED statement with your subject variable in the SUBJECT= option. That will fit a GEE model that adjusts for the correlation within subjects. If you have 100+ levels of a categorical predictor in the CLASS statement, this will probably cause model fitting problems. If there is any logical grouping of these levels into a smaller set of levels, then that is more likely to work. You could either use the DATA step with IF THEN ELSE statements to create a new grouping variable based on the old one, or you could use PROC FORMAT to create a format that groups the levels as desired.

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 5 replies
  • 230 views
  • 6 likes
  • 5 in conversation