Programming the statistical procedures from SAS

Removing levels in the logistic regression model

Accepted Solution Solved
Reply
Contributor
Posts: 24
Accepted Solution

Removing levels in the logistic regression model

Suppose if we have a categorical variable called zip codes and it has too many levels which is affecting the model. What is the most appropiate method to reduce the level of zipcodes in the logistic regression model.

 

Use greencase method?

Dummy variables?

Make it continous by reducing levels?

Frequency method?

 

How would anyone deal with it? Is greencase method appropriate to reduce the levels of zip code?

 

Sameer


Accepted Solutions
Solution
‎12-16-2016 12:56 AM
Grand Advisor
Posts: 16,880

Re: Removing levels in the logistic regression model

1. Combine into larger spatial areas that make sense geographically

2. Combine based on similar measurements of other variables -> perhaps via a cluster mechanism. 

 

1 is easier and you msintsin interpretability of your model. It can be revisited in a later revision of the model, 

View solution in original post


All Replies
Grand Advisor
Posts: 16,880

Re: Removing levels in the logistic regression model

What is the Greencase method? Google shows nothing....

 

Combine into spatially larger regions. Maybe counties?

Grand Advisor
Posts: 10,043

Re: Removing levels in the logistic regression model

Depending on my project I would likely try to identify Zip codes with similar characterstics pertinent to the dependent variables that are contiguous and recode. I would not combine Zips that were predominately rural with low population densite with urban or suburban for example. If practical you might look to replace with Metropolitan Statistical Areas or similar.

 

Note that Proc Logisitc handles categorical variables by creating internal dummy variables for the levels of the variable (minus one ).

 

In no way should Zip codes EVER be allowed to be treated as continuous.

Contributor
Posts: 24

Re: Removing levels in the logistic regression model

Cluster by using Greenacre's method.

 

Also I never said we can make it continous but reiduce it giving numbers to level..

 

thanks everyone..

Grand Advisor
Posts: 10,043

Re: Removing levels in the logistic regression model


sameer112217 wrote:

Also I never said we can make it continous but reiduce it giving numbers to level..

 


From your orignal post:

"Make it continous by reducing levels?"

Grand Advisor
Posts: 9,451

Re: Removing levels in the logistic regression model

Does every level have enough obs for model ?

If it was, you could try PROC HPGENSELECT to pick up the most significant levels.

Contributor
Posts: 24

Re: Removing levels in the logistic regression model

yes like suppose if we have zip code for a city like mumbai in bandra region which starts from 400064. Mumbai comes in the state maharashtra, We could make it 1 for mumbai city irrespective of region...something like that...what is the best way to do in real corporate world to reduce the levels in regression?

Solution
‎12-16-2016 12:56 AM
Grand Advisor
Posts: 16,880

Re: Removing levels in the logistic regression model

1. Combine into larger spatial areas that make sense geographically

2. Combine based on similar measurements of other variables -> perhaps via a cluster mechanism. 

 

1 is easier and you msintsin interpretability of your model. It can be revisited in a later revision of the model, 

Grand Advisor
Posts: 10,043

Re: Removing levels in the logistic regression model

One way would be to create custom formats to group your codes. That way you need not actually change data and you could have multiple formats as needed. Proc Logistic will honor the formatted values to create groups. Here is a simplistic example modified from SAS online documentation to illustrate:

Data Neuralgia;
   input Treatment $ Sex $ Age Duration Pain $ @@;
   datalines;
P  F  68   1  No   B  M  74  16  No  P  F  67  30  No
P  M  66  26  Yes  B  F  67  28  No  B  F  77  16  No
A  F  71  12  No   B  F  72  50  No  B  F  76   9  Yes
A  M  71  17  Yes  A  F  63  27  No  A  F  69  18  Yes
B  F  66  12  No   A  M  62  42  No  P  F  64   1  Yes
A  F  64  17  No   P  M  74   4  No  A  F  72  25  No
P  M  70   1  Yes  B  M  66  19  No  B  M  59  29  No
A  F  64  30  No   A  M  70  28  No  A  M  69   1  No
B  F  78   1  No   P  M  83   1  Yes B  F  69  42  No
B  M  75  30  Yes  P  M  77  29  Yes P  F  79  20  Yes
A  M  70  12  No   A  F  69  12  No  B  F  65  14  No
B  M  70   1  No   B  M  67  23  No  A  M  76  25  Yes
P  M  78  12  Yes  B  M  77   1  Yes B  F  69  24  No
P  M  66   4  Yes  P  F  65  29  No  P  M  60  26  Yes
A  M  78  15  Yes  B  M  75  21  Yes A  F  67  11  No
P  F  72  27  No   P  F  70  13  Yes A  M  75   6  Yes
B  F  65   7  No   P  F  68  27  Yes P  M  68  11  Yes
P  M  67  17  Yes  B  M  70  22  No  A  M  65  15  No
P  F  67   1  Yes  A  M  67  10  No  P  F  72  11  Yes
A  F  74   1  No   B  M  80  21  Yes A  F  69   3  No
;
run;

proc format library=work;
value $alttreat
"P","A" = 'Alt'
;
run;
proc logistic data=Neuralgia;
   Title 'Original Treatment values';
   class Treatment Sex;
   model Pain= Treatment Sex Treatment*Sex Age Duration / expb;
run;

proc logistic data=Neuralgia;
   title "Formatted treatment values";
   class Treatment Sex;
   model Pain= Treatment Sex Treatment*Sex Age Duration / expb;
   format treatment $altTreat.;
run; title;




In the case of cities that may have muliple codes it would likely be relatively easy using a reference data set to create format to represent city from codes and another with province/state or similar.

 

I do this for my data with a data sources that only has a postal code to get either city or county (a sub-region of states within the USA)

☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 8 replies
  • 135 views
  • 1 like
  • 4 in conversation