DATA Step, Macro, Functions and more

all combination with conditions

Reply
Occasional Contributor
Posts: 15

all combination with conditions

Hi everyone,

 

I would like to ask for helping with the programming logic. I have a list of variables after clustering which will be used in the regression step for model building. Please see the table below:

 

1.PNG

 

 

I have used the code proc reg to create all combination before the regression step. Please see the code below:

 

/*Combination selection*/

proc sql noprint;
select Variable
into :varlist
separated by " "
from PD.cluster_&segment._select;
quit;

%macro select(n,out);

dm 'odsresults; clear';

proc reg data=PD.MEF_ALL_&segment.;
model logit_ODR=&varlist.
/ selection=adjrsq start=&n. stop=&n.;
ods output subsetselsummary=&out._&n.;
quit;

%mend;

%select(1,select);
%select(2,select);
%select(3,select);

However, the combination which have been create from the code above are contain the same variables in same cluster or the same sub_group. I would like the help for logic to avoid a combination with same cluster or same sub_group. The combination should contain only difference cluster or not same sub_group.

 

Please help for programming logic, I have been working on this for 2 days but I couldn't figure it out.

 

Thank a lot.

Respected Advisor
Posts: 2,661

Re: all combination with conditions

I can't help with the programming logic, but I can help with the analysis logic, which is poor. Running all these stepwise regressions is not a good idea. Stepwise by itself is not a good idea (just go to your favorite search engine, type in "problems with stepwise regression").

 

A better solution is to fit your model using PROC PLS (Partial Least Squares) which will perform much better with all these correlated variables than anything stepwise regression can produce.

--
Paige Miller
PROC Star
Posts: 8,117

Re: all combination with conditions

How to you want to determine which variables are included and excluded? I presume that you want to exclude independent variables based on what is already selected. That is, no action needed when n=1 (let's assume that the variable selected was durable_C_lg6, from Cluster_num 1 and the Sub_group durable). For n=2, the first variable will be the one from n=1 (i.e., durable_C_lg6), but the second variable has to be the next one that adds the most to the model but selected from a list that doesn't contain any variable from either Cluster_num=1 or the Sub_group durable (let's assume that will be PCI_C_lg1 (from Cluster_num=2 and Sub_group=PCI).

 

Then, for n=3, you want to select durable_C_lg6 , PCI_C_lg1 and the next variable that accounts for the most variance, but from a list that doesn't include any variables from Cluster_nums 1 or 2, or from Sub_groups durable or PCI.

 

Is that what you are trying to accomplish?

 

Art, CEO, AnalystFinder.com

 

 

Occasional Contributor
Posts: 15

Re: all combination with conditions

Yes, this is what I mean.

 

For n=3 variable will be I_CVM_C_lg8, which is from cluster 3 and not in durable sub group.

 

Any suggestion on this?

PROC Star
Posts: 8,117

Re: all combination with conditions

@LengYi: You didn't provide an example dataset, so my suggested code (below) starts by creating a test dataset and variable list.

I didn't bother to turn it into a macro, but that could easily be done if the process has to be run repeatedly.

 

Let me/us know if the following logic matches what you want to accomplish:

 

data have (drop=_:);
  set sashelp.class (drop=name rename=(sex=_sex));
  if _sex eq 'M' then sex=0;
  else sex=1;
  age2=age**2;
  ageplus=age+5;
  height2=height**2;
  heightplus=height+5;
  weight2=weight**2;
  weightplus=weight+5;
run;

data varlist;
  informat Variable $upcase10.;
  informat Sub_group $upcase10.;
  input Variable Cluster_num Sub_group $;
  cards;
age 1 age
age2 1 age
ageplus 1 ageplus
height 2 height
height2 2 height
heightplus 2 heightplus
weight 3 weight
weight2 3 weight
weightplus 3 weightplus
;

proc sort data=varlist;
  by Variable;
run;

proc sql noprint;
  select upcase(Variable)
    into :varlist separated by " "
      from /*PD.cluster_&segment._select*/ varlist
  ;
quit;

dm 'odsresults; clear';

title 'Model 1';
proc reg data=/*PD.MEF_ALL_&segment.*/ have;
  model /*logit_ODR*/sex=&varlist.
   / selection=adjrsq start=1/*&n.*/ stop=1/*&n.*/;
  ods output subsetselsummary=/*&out.*/select_1/*_&n.*/;
quit;

data included (keep=variable);
  set select_1 (obs=1);
  i=1;
  do while (scan(VarsInModel,i) ne '');
    variable=upcase(scan(VarsInModel,i));
    i+1;
    output;
  end;
run;

proc sort data=included;
  by variable;
run;

data included;
  merge included (in=ina) varlist;
  by variable;
  if ina;
run;

proc sql noprint;
  select Cluster_num, Sub_group
    into :clusters separated by ',',
         :subgroups separated by ','
      from included
  ;
quit;
%put &clusters.;
%put &subgroups.;

proc sql noprint;
  select a.Variable
    into :varlist separated by ' '
      from /*PD.cluster_&segment._select*/ varlist a
        left join included b
          on a.variable eq b.variable
            where a.variable eq b.variable
              or
                (a.Cluster_num not in (&clusters.) and
                 a.Sub_group not in ("&subgroups."))
  ;
quit;
%put &varlist.;

title 'Model 2';
proc reg data=/*PD.MEF_ALL_&segment.*/ have;
  model /*logit_ODR*/sex=&varlist.
   / selection=adjrsq start=2/*&n.*/ stop=2/*&n.*/;
  ods output subsetselsummary=/*&out.*/select_2;
quit;
data included (keep=variable);
  set select_2 (obs=1);
  i=1;
  do while (scan(VarsInModel,i) ne '');
    variable=upcase(scan(VarsInModel,i));
    i+1;
    output;
  end;
run;

proc sort data=included;
  by variable;
run;

data included;
  merge included (in=ina) varlist;
  by variable;
  if ina;
run;

proc sql noprint;
  select Cluster_num, Sub_group
    into :clusters separated by ',',
         :subgroups separated by ','
      from included
  ;
quit;
%put &clusters.;
%put &subgroups.;

proc sql noprint;
  select a.Variable
    into :varlist separated by ' '
      from /*PD.cluster_&segment._select*/ varlist a
        left join included b
          on a.variable eq b.variable
            where a.variable eq b.variable
              or
                (a.Cluster_num not in (&clusters.) and
                 a.Sub_group not in ("&subgroups."))
  ;
quit;
%put &varlist.;

title 'Model 3';
proc reg data=/*PD.MEF_ALL_&segment.*/ have;
  model /*logit_ODR*/sex=&varlist.
   / selection=adjrsq start=3/*&n.*/ stop=3/*&n.*/;
  ods output subsetselsummary=/*&out.*/select_3;
quit;

Art, CEO, AnalystFinder.com

 

Respected Advisor
Posts: 2,661

Re: all combination with conditions

Or you could just decide to not do any variable selection using Stepwise, and get a better model fit using PLS. Better results, less effort, it's a win!

--
Paige Miller
PROC Star
Posts: 8,117

Re: all combination with conditions

Posted in reply to PaigeMiller

I don't disagree with @PaigeMiller regarding the problems with stepwise regression but, in some cases (and possibly yours), it may be the best available method for selecting variables from a group. Since you have clusters, and are only selecting one variable per cluster, your process should help limit the possibility of multicolinearity.

 

The principal problem (I think) that most have raised with stepwise regression, is the use of the resulting statistics. However, if you use the model to select a model, and then run the resulting model on a non-stepwise regression, I'd think that would eliminate one of the concerns.

 

Similarly, if you separate your data into test and validation subsets, build the model on the test subset (say a randomly selected 30% of the data), then use that model in a non-stepwise regression and then validate it (and get your model statistics) from the validation subset, that should correct for another of the main concerns.

 

However, I'm not a statistician, thus would appreciate hearing from them regarding what I just said.

 

Art, CEO, AnalystFinder.com

 

Respected Advisor
Posts: 2,661

Re: all combination with conditions

@art297, there is a second problem with the method proposed by @LengYi

 

He begins with a clustering to select variables. Typically (although LengYi doesn't say this), this clustering is done without taking into account whether or not the variables are predictive of the Y-variable(s). To me this is a fatal flaw, and could fail to find variable that are predcitive, in which case the final stepwise regression model will also not have these predictive variables. PLS doesn't have this drawback.

--
Paige Miller
Ask a Question
Discussion stats
  • 7 replies
  • 130 views
  • 0 likes
  • 3 in conversation