BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
sameer112217
Quartz | Level 8

Suppose if we have a categorical variable called zip codes and it has too many levels which is affecting the model. What is the most appropiate method to reduce the level of zipcodes in the logistic regression model.

 

Use greencase method?

Dummy variables?

Make it continous by reducing levels?

Frequency method?

 

How would anyone deal with it? Is greencase method appropriate to reduce the levels of zip code?

 

Sameer

1 ACCEPTED SOLUTION

Accepted Solutions
Reeza
Super User

1. Combine into larger spatial areas that make sense geographically

2. Combine based on similar measurements of other variables -> perhaps via a cluster mechanism. 

 

1 is easier and you msintsin interpretability of your model. It can be revisited in a later revision of the model, 

View solution in original post

8 REPLIES 8
Reeza
Super User

What is the Greencase method? Google shows nothing....

 

Combine into spatially larger regions. Maybe counties?

ballardw
Super User

Depending on my project I would likely try to identify Zip codes with similar characterstics pertinent to the dependent variables that are contiguous and recode. I would not combine Zips that were predominately rural with low population densite with urban or suburban for example. If practical you might look to replace with Metropolitan Statistical Areas or similar.

 

Note that Proc Logisitc handles categorical variables by creating internal dummy variables for the levels of the variable (minus one ).

 

In no way should Zip codes EVER be allowed to be treated as continuous.

sameer112217
Quartz | Level 8

Cluster by using Greenacre's method.

 

Also I never said we can make it continous but reiduce it giving numbers to level..

 

thanks everyone..

ballardw
Super User

@sameer112217 wrote:

Also I never said we can make it continous but reiduce it giving numbers to level..

 


From your orignal post:

"Make it continous by reducing levels?"

Ksharp
Super User

Does every level have enough obs for model ?

If it was, you could try PROC HPGENSELECT to pick up the most significant levels.

sameer112217
Quartz | Level 8

yes like suppose if we have zip code for a city like mumbai in bandra region which starts from 400064. Mumbai comes in the state maharashtra, We could make it 1 for mumbai city irrespective of region...something like that...what is the best way to do in real corporate world to reduce the levels in regression?

Reeza
Super User

1. Combine into larger spatial areas that make sense geographically

2. Combine based on similar measurements of other variables -> perhaps via a cluster mechanism. 

 

1 is easier and you msintsin interpretability of your model. It can be revisited in a later revision of the model, 

ballardw
Super User

One way would be to create custom formats to group your codes. That way you need not actually change data and you could have multiple formats as needed. Proc Logistic will honor the formatted values to create groups. Here is a simplistic example modified from SAS online documentation to illustrate:

Data Neuralgia;
   input Treatment $ Sex $ Age Duration Pain $ @@;
   datalines;
P  F  68   1  No   B  M  74  16  No  P  F  67  30  No
P  M  66  26  Yes  B  F  67  28  No  B  F  77  16  No
A  F  71  12  No   B  F  72  50  No  B  F  76   9  Yes
A  M  71  17  Yes  A  F  63  27  No  A  F  69  18  Yes
B  F  66  12  No   A  M  62  42  No  P  F  64   1  Yes
A  F  64  17  No   P  M  74   4  No  A  F  72  25  No
P  M  70   1  Yes  B  M  66  19  No  B  M  59  29  No
A  F  64  30  No   A  M  70  28  No  A  M  69   1  No
B  F  78   1  No   P  M  83   1  Yes B  F  69  42  No
B  M  75  30  Yes  P  M  77  29  Yes P  F  79  20  Yes
A  M  70  12  No   A  F  69  12  No  B  F  65  14  No
B  M  70   1  No   B  M  67  23  No  A  M  76  25  Yes
P  M  78  12  Yes  B  M  77   1  Yes B  F  69  24  No
P  M  66   4  Yes  P  F  65  29  No  P  M  60  26  Yes
A  M  78  15  Yes  B  M  75  21  Yes A  F  67  11  No
P  F  72  27  No   P  F  70  13  Yes A  M  75   6  Yes
B  F  65   7  No   P  F  68  27  Yes P  M  68  11  Yes
P  M  67  17  Yes  B  M  70  22  No  A  M  65  15  No
P  F  67   1  Yes  A  M  67  10  No  P  F  72  11  Yes
A  F  74   1  No   B  M  80  21  Yes A  F  69   3  No
;
run;

proc format library=work;
value $alttreat
"P","A" = 'Alt'
;
run;
proc logistic data=Neuralgia;
   Title 'Original Treatment values';
   class Treatment Sex;
   model Pain= Treatment Sex Treatment*Sex Age Duration / expb;
run;

proc logistic data=Neuralgia;
   title "Formatted treatment values";
   class Treatment Sex;
   model Pain= Treatment Sex Treatment*Sex Age Duration / expb;
   format treatment $altTreat.;
run; title;




In the case of cities that may have muliple codes it would likely be relatively easy using a reference data set to create format to represent city from codes and another with province/state or similar.

 

I do this for my data with a data sources that only has a postal code to get either city or county (a sub-region of states within the USA)

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 8 replies
  • 1800 views
  • 1 like
  • 4 in conversation