BookmarkSubscribeRSS Feed
BTAinRVA
Quartz | Level 8

Hello Everybody!

 

Does anyone have any examples of using HPGENSELECT to partition a data set into train and test subsets on a binary dependent variable using the LASSO selection method. I haven't found much by googling.

 

Thanks,

Brian

5 REPLIES 5
SteveDenham
Jade | Level 19

Maybe I am missing a key point, but I think just adding the statement:

 

PARTITION fraction(test=0.25 validate=0.25 seed=1);

 

ought to randomly assign records to the training, test and validate subsets, such that half the original data is in the training subset and then half of the remaining data would go to the test subset and the other half to the validate subset.  Other proportions are probably better suited for a binary response, but this should serve as a starting point.

 

SteveDenham

BTAinRVA
Quartz | Level 8

Steve,

Thanks for the reply! I think you are correct with the Partition statement. Have you used HPGENSELECT much? Does the code below seem reasonable and correct? Any guidance greatly appreciated!

 

proc surveyselect data=out.final method=srs seed=43543 outall
  samprate=0.8 out=subsets;

proc hpgenselect data=subsets lassosteps=50;
	model &depvar (event='1') =  &VARLIST / dist=binomial;
	partition ROLEVAR=Selected(TRAIN="1" VALIDATE="0");
	selection method=LASSO(SELECT=SBC choose=validate stop=none);
run;
SteveDenham
Jade | Level 19

@BTAinRVA 

 

To be honest, all I know about HPGENSELECT is what I have read in the documentation - I've never actually had high-dimensional datasets where I had to do variable selection.  Given that caveat, I want to ask about the PARTITION statement you propose.

 

It looks like there must be a variable called 'Selected' in the data set 'Subsets', which takes on values of 0, 1 and stuff that is not 0 and 1.  Those with a 0 go into the test set, those with a 1 go into the validate set, and those with anything else go into the training set, which is what is operated on for variable selection. If Selected happens to only have 0 and 1 values, I don't think the PROC will work, as there would be no observations in the training set.  This is how the PARTITION statement works in other regression PROCS.

 

Good luck on this.

 

SteveDenham

BTAinRVA
Quartz | Level 8

Steve,

Thanks again for your feedback. It is greatly appreciated!

I actually got the code below to run without errors and it looks like it put those with Selected=1 into the training data set and those with Selected=0 into the validation set. I got this in the lst file output:

HPGenOutput.PNG

* Create list of independent variables;
proc sql;
     select distinct
           field into: ind_vars separated by ' '
     from model.univariate_&dep_var
;quit;
 
* Randomly select 80% for training data
proc surveyselect data=out.final
                       method=srs
                       seed=43543
                       outall
                       samprate=0.8
                       out=subsets;
run;
 
ods listing;
proc hpgenselect data=subsets lassosteps=50;
     model case (event='1') = &ind_vars / dist = binomial;
     partition rolevar = selected(train = "1" validate ="0");
     selection method = lasso(choose = sbc );
     performance details;
run;

Although there were no errors in the log, the code stopped after Selected Effects: Intercept was entered into the model. Anyone know why that happened?

SteveDenham
Jade | Level 19

Can you share the output?  My first thought is that no variables met the LASSO criterion for inclusion under SBC.

 

SteveDenham

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 5 replies
  • 657 views
  • 3 likes
  • 2 in conversation