Hello Everybody!
Does anyone have any examples of using HPGENSELECT to partition a data set into train and test subsets on a binary dependent variable using the LASSO selection method. I haven't found much by googling.
Thanks,
Brian
Maybe I am missing a key point, but I think just adding the statement:
PARTITION fraction(test=0.25 validate=0.25 seed=1);
ought to randomly assign records to the training, test and validate subsets, such that half the original data is in the training subset and then half of the remaining data would go to the test subset and the other half to the validate subset. Other proportions are probably better suited for a binary response, but this should serve as a starting point.
SteveDenham
Steve,
Thanks for the reply! I think you are correct with the Partition statement. Have you used HPGENSELECT much? Does the code below seem reasonable and correct? Any guidance greatly appreciated!
proc surveyselect data=out.final method=srs seed=43543 outall samprate=0.8 out=subsets; proc hpgenselect data=subsets lassosteps=50; model &depvar (event='1') = &VARLIST / dist=binomial; partition ROLEVAR=Selected(TRAIN="1" VALIDATE="0"); selection method=LASSO(SELECT=SBC choose=validate stop=none); run;
To be honest, all I know about HPGENSELECT is what I have read in the documentation - I've never actually had high-dimensional datasets where I had to do variable selection. Given that caveat, I want to ask about the PARTITION statement you propose.
It looks like there must be a variable called 'Selected' in the data set 'Subsets', which takes on values of 0, 1 and stuff that is not 0 and 1. Those with a 0 go into the test set, those with a 1 go into the validate set, and those with anything else go into the training set, which is what is operated on for variable selection. If Selected happens to only have 0 and 1 values, I don't think the PROC will work, as there would be no observations in the training set. This is how the PARTITION statement works in other regression PROCS.
Good luck on this.
SteveDenham
Steve,
Thanks again for your feedback. It is greatly appreciated!
I actually got the code below to run without errors and it looks like it put those with Selected=1 into the training data set and those with Selected=0 into the validation set. I got this in the lst file output:
* Create list of independent variables;
proc sql;
select distinct
field into: ind_vars separated by ' '
from model.univariate_&dep_var
;quit;
* Randomly select 80% for training data
proc surveyselect data=out.final
method=srs
seed=43543
outall
samprate=0.8
out=subsets;
run;
ods listing;
proc hpgenselect data=subsets lassosteps=50;
model case (event='1') = &ind_vars / dist = binomial;
partition rolevar = selected(train = "1" validate ="0");
selection method = lasso(choose = sbc );
performance details;
run;
Although there were no errors in the log, the code stopped after Selected Effects: Intercept was entered into the model. Anyone know why that happened?
Can you share the output? My first thought is that no variables met the LASSO criterion for inclusion under SBC.
SteveDenham
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.