Suppose if we have a categorical variable called zip codes and it has too many levels which is affecting the model. What is the most appropiate method to reduce the level of zipcodes in the logistic regression model.
Use greencase method?
Dummy variables?
Make it continous by reducing levels?
Frequency method?
How would anyone deal with it? Is greencase method appropriate to reduce the levels of zip code?
Sameer
1. Combine into larger spatial areas that make sense geographically
2. Combine based on similar measurements of other variables -> perhaps via a cluster mechanism.
1 is easier and you msintsin interpretability of your model. It can be revisited in a later revision of the model,
What is the Greencase method? Google shows nothing....
Combine into spatially larger regions. Maybe counties?
Depending on my project I would likely try to identify Zip codes with similar characterstics pertinent to the dependent variables that are contiguous and recode. I would not combine Zips that were predominately rural with low population densite with urban or suburban for example. If practical you might look to replace with Metropolitan Statistical Areas or similar.
Note that Proc Logisitc handles categorical variables by creating internal dummy variables for the levels of the variable (minus one ).
In no way should Zip codes EVER be allowed to be treated as continuous.
Cluster by using Greenacre's method.
Also I never said we can make it continous but reiduce it giving numbers to level..
thanks everyone..
@sameer112217 wrote:
Also I never said we can make it continous but reiduce it giving numbers to level..
From your orignal post:
"Make it continous by reducing levels?"
Does every level have enough obs for model ?
If it was, you could try PROC HPGENSELECT to pick up the most significant levels.
yes like suppose if we have zip code for a city like mumbai in bandra region which starts from 400064. Mumbai comes in the state maharashtra, We could make it 1 for mumbai city irrespective of region...something like that...what is the best way to do in real corporate world to reduce the levels in regression?
1. Combine into larger spatial areas that make sense geographically
2. Combine based on similar measurements of other variables -> perhaps via a cluster mechanism.
1 is easier and you msintsin interpretability of your model. It can be revisited in a later revision of the model,
One way would be to create custom formats to group your codes. That way you need not actually change data and you could have multiple formats as needed. Proc Logistic will honor the formatted values to create groups. Here is a simplistic example modified from SAS online documentation to illustrate:
Data Neuralgia; input Treatment $ Sex $ Age Duration Pain $ @@; datalines; P F 68 1 No B M 74 16 No P F 67 30 No P M 66 26 Yes B F 67 28 No B F 77 16 No A F 71 12 No B F 72 50 No B F 76 9 Yes A M 71 17 Yes A F 63 27 No A F 69 18 Yes B F 66 12 No A M 62 42 No P F 64 1 Yes A F 64 17 No P M 74 4 No A F 72 25 No P M 70 1 Yes B M 66 19 No B M 59 29 No A F 64 30 No A M 70 28 No A M 69 1 No B F 78 1 No P M 83 1 Yes B F 69 42 No B M 75 30 Yes P M 77 29 Yes P F 79 20 Yes A M 70 12 No A F 69 12 No B F 65 14 No B M 70 1 No B M 67 23 No A M 76 25 Yes P M 78 12 Yes B M 77 1 Yes B F 69 24 No P M 66 4 Yes P F 65 29 No P M 60 26 Yes A M 78 15 Yes B M 75 21 Yes A F 67 11 No P F 72 27 No P F 70 13 Yes A M 75 6 Yes B F 65 7 No P F 68 27 Yes P M 68 11 Yes P M 67 17 Yes B M 70 22 No A M 65 15 No P F 67 1 Yes A M 67 10 No P F 72 11 Yes A F 74 1 No B M 80 21 Yes A F 69 3 No ; run; proc format library=work; value $alttreat "P","A" = 'Alt' ; run; proc logistic data=Neuralgia; Title 'Original Treatment values'; class Treatment Sex; model Pain= Treatment Sex Treatment*Sex Age Duration / expb; run; proc logistic data=Neuralgia; title "Formatted treatment values"; class Treatment Sex; model Pain= Treatment Sex Treatment*Sex Age Duration / expb; format treatment $altTreat.; run; title;
In the case of cities that may have muliple codes it would likely be relatively easy using a reference data set to create format to represent city from codes and another with province/state or similar.
I do this for my data with a data sources that only has a postal code to get either city or county (a sub-region of states within the USA)
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.