BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Fae
Obsidian | Level 7 Fae
Obsidian | Level 7

I am trying to build a logistic regression model for campaign scoring.

 

This is for retail grocery..

 

What I have is 2 type of data.

 

Target event:                              customer_id, response(Y/N)

Customer Transaction data:      customer_id, transaction_Date, channel, product_type, price

 

Where channel and product_type is categorical variable.

 

Usually I convert  categorical variable into dummy variable, but in this case there's 20+channel and 100+ product_type, so i am not sure what to do.  Do I do cluster analysis on the before categorical variable before i aggregate them to the customer level or if there a better way?  Please help.  Thanks.

 

1 ACCEPTED SOLUTION

Accepted Solutions
Reeza
Super User

How big is your data set? One option would be to combine the data by categories and then create multiple variables. 

 

ie 

 

TotSpend_Sports, TotSpend_Food, TotSpend_Kids, TotSpend_Housewares, etc

View solution in original post

5 REPLIES 5
PaigeMiller
Diamond | Level 26

There's no need to create dummy variables for PROC LOGISTIC. The CLASS statement in PROC LOGISTIC will handle that for you.

 

If you have 100+ product types, this could be a problem for the analysis. If there is some logical way to group some of these product types together, I would give that a try.

--
Paige Miller
Fae
Obsidian | Level 7 Fae
Obsidian | Level 7

Thanks very much for your help.  

 

To make sure I am not making a mistake.  for proc Logistic,  I can't have multiple data records for each customer so I need to prepare the data as below, right?

 

Customer1: Response, Predictor variable 1, ....., predictor variable N, 

Customer2: Response, Predictor variable 1, ....., predictor variable N, 

..

CustomerN: Response, Predictor variable 1, ....., predictor variable N, 

 

 

The biggest issues I am facing is that since a customer can have multiple transactions, I don't know how to aggregate them to the customer level without losing the behavior information such as channel and product type.

 

I can easily aggregate the amount_spent by summing them, but what should I do if it's a category variable?

 

Thanks very much.

Reeza
Super User

How big is your data set? One option would be to combine the data by categories and then create multiple variables. 

 

ie 

 

TotSpend_Sports, TotSpend_Food, TotSpend_Kids, TotSpend_Housewares, etc

PGStats
Opal | Level 21

You could either

 

1) Generate your predictor matrix with proc transpose - requires the replacement of missing values with zéros

or

2) Generate your predictor matrix with proc logistic designonly outdesign= - requires that you sum up predictor values for each Customer.

PG
StatDave
SAS Super FREQ

If you have repeated measurements on your customers, then you can use PROC GENMOD to fit the logistic model (use DIST=BIN option) and the REPEATED statement with your subject variable in the SUBJECT= option. That will fit a GEE model that adjusts for the correlation within subjects. If you have 100+ levels of a categorical predictor in the CLASS statement, this will probably cause model fitting problems. If there is any logical grouping of these levels into a smaller set of levels, then that is more likely to work. You could either use the DATA step with IF THEN ELSE statements to create a new grouping variable based on the old one, or you could use PROC FORMAT to create a format that groups the levels as desired.

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 5 replies
  • 1590 views
  • 6 likes
  • 5 in conversation