BookmarkSubscribeRSS Feed
LadyIDO
Obsidian | Level 7

Hi,

Is there a workaround to use selection=score when there are categorical variables (>2 levels)? 

Thanks.

Anne

5 REPLIES 5
StatDave
SAS Super FREQ

That method is not available with CLASS variables. You could, of course, eliminate the CLASS statement and expand each CLASS variable into an independent set of dummy variables and then use those in the MODEL statement. That can be done by using a separate PROC LOGISTIC step with the OUTDESIGN= and OUTDESIGNONLY options, and with the PARAM=REF option in the CLASS statement similar to as shown in this note. That has the complication of the selected models containing only portions of some of the original categorical variables, but that could also be considered an advantage as a further simplification of the final model. Alternatively, you could consider a more modern selection method such as the LASSO method available by fitting the model in PROC HPGENSELECT. The LASSO method is based on adding a penalty in the likelihood function and shrinking unimportant variables to zero.

LadyIDO
Obsidian | Level 7

Hi StatDave,

 

Could you elaborate on 

"That has the complication of the selected models containing only portions of some of the original categorical variables, but that could also be considered an advantage as a further simplification of the final model."

I'm also looking at using WOE (Lund 7860-2016.pdf (sas.com)

PROC HPGENSELECT is something I meant to look at for some time. Thanks for the suggestion.

Anne

StatDave
SAS Super FREQ

Since the generated dummy variables will be treated as completely separate variables, the selection method could choose only a subset of them, so that only a portion of the original variable is selected. But this is a bit like reducing the number of categories in the original categorical variable in such as way as to include only the important contrasts with the reference level. That might be a desirable result when using a selection method to select a parsimonious model.

 

Another thought on an alternative to the SELECTION=SCORE method in PROC LOGISTIC: While it is not the same method and cannot be implemented in a single procedure step, a somewhat similar thing could be done by sequentially using the START= option in the MODEL statement and the MAXEFFECTS= option in the SELECTION statement in either PROC HPLOGISTIC or PROC HPGENSELECT. This method can be used with your CLASS variables. The following illustrates it. Note that MAXEFFECTS=1 allows only the intercept, so values greater than 1 select candidate predictors. The following generates simulated data with candidate continuous variables x1-x5 and categorical variables c1 and c2. The known true model involves only the x1, x2, and c1 variables. Selection starts with MAXEFFECTS=2 to find the best model of size 1 (1 predictor). In the next step, the variable selected is specified in the START= option and MAXEFFECTS= is increased to 3 to find the best model of size 3 (2 predictors). And so on to find the best models of increasing size. This uses significance level for selection, but there are options in the SELECTION statement that lets you select other criteria. Note that x: refers to all variables with names beginning with "x". Similarly with c: .

     data Simdata;
         drop i j;
         array x{5} x1-x5;
         do i=1 to 1000;
            do j=1 to 5;
               x{j} = ranuni(1); /* Continuous predictors */
            end;
            c1 = int(1.5+ranuni(1)*7); /* Classification predictors */
            c2 = 1 + mod(i,3);
            yTrue = 2 + 5*x2 - 17*x1*x2 + 6*(c1=2) + 5*(c1=5);
            y = yTrue + 2*rannor(1);
            p=logistic(yTrue);
            b=rantbl(1,1-p,p)-1;
            output Simdata;
         end;
      run;
proc hplogistic;
class c:;
model b=x: c: ;
selection method=stepwise(maxeffects=2 sle=.3 sls=.5);
run;
proc hplogistic;
class c:;
model b=x: c: / start=(x1);
selection method=stepwise(maxeffects=3 sle=.3 sls=.5);
run;
proc hplogistic;
class c:;
model b=x: c: / start=(x1 c1);
selection method=stepwise(maxeffects=4 sle=.3 sls=.5);
run;
proc hplogistic;
class c:;
model b=x: c: / start=(x1 c1 x2);
selection method=stepwise(maxeffects=5 sle=.3 sls=.5);
run;
proc hplogistic;
class c:;
model b=x: c: / start=(x1 c1 x2 c2);
selection method=stepwise(maxeffects=6 sle=.3 sls=.5);
run;

LadyIDO
Obsidian | Level 7

Also, for some categorical variables, I got the 'Blank or duplicate name or invalid subscript' error when I used selection=socre, but not with selection=stepwise. Any idea of why this is?

 

LadyIDO
Obsidian | Level 7

Well, I can answer my own question. The values of that categorical variable are kind of long. I renamed the level to 1,2,3 ... That solved the problem!

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 5 replies
  • 823 views
  • 1 like
  • 2 in conversation