Solved: Removing levels in the logistic regression model

sameer112217 · Posted 12-14-2016 08:29 AM

Suppose if we have a categorical variable called zip codes and it has too many levels which is affecting the model. What is the most appropiate method to reduce the level of zipcodes in the logistic regression model.

Use greencase method?

Dummy variables?

Make it continous by reducing levels?

Frequency method?

How would anyone deal with it? Is greencase method appropriate to reduce the levels of zip code?

Sameer

Reeza · Posted 12-15-2016 08:59 AM

1. Combine into larger spatial areas that make sense geographically

2. Combine based on similar measurements of other variables -> perhaps via a cluster mechanism.

1 is easier and you msintsin interpretability of your model. It can be revisited in a later revision of the model,

View solution in original post

Reeza · Posted 12-14-2016 08:50 AM

What is the Greencase method? Google shows nothing....

Combine into spatially larger regions. Maybe counties?

ballardw · Posted 12-14-2016 10:26 AM

Depending on my project I would likely try to identify Zip codes with similar characterstics pertinent to the dependent variables that are contiguous and recode. I would not combine Zips that were predominately rural with low population densite with urban or suburban for example. If practical you might look to replace with Metropolitan Statistical Areas or similar.

Note that Proc Logisitc handles categorical variables by creating internal dummy variables for the levels of the variable (minus one ).

In no way should Zip codes EVER be allowed to be treated as continuous.

sameer112217 · Posted 12-14-2016 12:00 PM

Cluster by using Greenacre's method.

Also I never said we can make it continous but reiduce it giving numbers to level..

thanks everyone..

ballardw · Posted 12-14-2016 12:41 PM

@sameer112217 wrote:

Also I never said we can make it continous but reiduce it giving numbers to level..

From your orignal post:

"Make it continous by reducing levels?"

Ksharp · Posted 12-15-2016 03:20 AM

Does every level have enough obs for model ?

If it was, you could try PROC HPGENSELECT to pick up the most significant levels.

sameer112217 · Posted 12-15-2016 08:56 AM

yes like suppose if we have zip code for a city like mumbai in bandra region which starts from 400064. Mumbai comes in the state maharashtra, We could make it 1 for mumbai city irrespective of region...something like that...what is the best way to do in real corporate world to reduce the levels in regression?

Reeza · Posted 12-15-2016 08:59 AM

1. Combine into larger spatial areas that make sense geographically

2. Combine based on similar measurements of other variables -> perhaps via a cluster mechanism.

1 is easier and you msintsin interpretability of your model. It can be revisited in a later revision of the model,

ballardw · Posted 12-15-2016 11:18 AM

One way would be to create custom formats to group your codes. That way you need not actually change data and you could have multiple formats as needed. Proc Logistic will honor the formatted values to create groups. Here is a simplistic example modified from SAS online documentation to illustrate:

Data Neuralgia;
   input Treatment $ Sex $ Age Duration Pain $ @@;
   datalines;
P  F  68   1  No   B  M  74  16  No  P  F  67  30  No
P  M  66  26  Yes  B  F  67  28  No  B  F  77  16  No
A  F  71  12  No   B  F  72  50  No  B  F  76   9  Yes
A  M  71  17  Yes  A  F  63  27  No  A  F  69  18  Yes
B  F  66  12  No   A  M  62  42  No  P  F  64   1  Yes
A  F  64  17  No   P  M  74   4  No  A  F  72  25  No
P  M  70   1  Yes  B  M  66  19  No  B  M  59  29  No
A  F  64  30  No   A  M  70  28  No  A  M  69   1  No
B  F  78   1  No   P  M  83   1  Yes B  F  69  42  No
B  M  75  30  Yes  P  M  77  29  Yes P  F  79  20  Yes
A  M  70  12  No   A  F  69  12  No  B  F  65  14  No
B  M  70   1  No   B  M  67  23  No  A  M  76  25  Yes
P  M  78  12  Yes  B  M  77   1  Yes B  F  69  24  No
P  M  66   4  Yes  P  F  65  29  No  P  M  60  26  Yes
A  M  78  15  Yes  B  M  75  21  Yes A  F  67  11  No
P  F  72  27  No   P  F  70  13  Yes A  M  75   6  Yes
B  F  65   7  No   P  F  68  27  Yes P  M  68  11  Yes
P  M  67  17  Yes  B  M  70  22  No  A  M  65  15  No
P  F  67   1  Yes  A  M  67  10  No  P  F  72  11  Yes
A  F  74   1  No   B  M  80  21  Yes A  F  69   3  No
;
run;

proc format library=work;
value $alttreat
"P","A" = 'Alt'
;
run;
proc logistic data=Neuralgia;
   Title 'Original Treatment values';
   class Treatment Sex;
   model Pain= Treatment Sex Treatment*Sex Age Duration / expb;
run;

proc logistic data=Neuralgia;
   title "Formatted treatment values";
   class Treatment Sex;
   model Pain= Treatment Sex Treatment*Sex Age Duration / expb;
   format treatment $altTreat.;
run; title;

In the case of cities that may have muliple codes it would likely be relatively easy using a reference data set to create format to represent city from codes and another with province/state or similar.

I do this for my data with a data sources that only has a postal code to get either city or county (a sub-region of states within the USA)

Removing levels in the logistic regression model

Re: Removing levels in the logistic regression model

Re: Removing levels in the logistic regression model

Re: Removing levels in the logistic regression model

Re: Removing levels in the logistic regression model

Re: Removing levels in the logistic regression model

Re: Removing levels in the logistic regression model

Re: Removing levels in the logistic regression model

Re: Removing levels in the logistic regression model

Re: Removing levels in the logistic regression model