Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Home
- /
- Analytics
- /
- Stat Procs
- /
- HPGENSELECT

Options

- RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Posted 06-11-2020 01:31 PM
(845 views)

Hello Everybody!

Does anyone have any examples of using HPGENSELECT to partition a data set into train and test subsets on a binary dependent variable using the LASSO selection method. I haven't found much by googling.

Thanks,

Brian

5 REPLIES 5

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Maybe I am missing a key point, but I think just adding the statement:

PARTITION fraction(test=0.25 validate=0.25 seed=1);

ought to randomly assign records to the training, test and validate subsets, such that half the original data is in the training subset and then half of the remaining data would go to the test subset and the other half to the validate subset. Other proportions are probably better suited for a binary response, but this should serve as a starting point.

SteveDenham

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Steve,

Thanks for the reply! I think you are correct with the Partition statement. Have you used HPGENSELECT much? Does the code below seem reasonable and correct? Any guidance greatly appreciated!

proc surveyselect data=out.final method=srs seed=43543 outall samprate=0.8 out=subsets; proc hpgenselect data=subsets lassosteps=50; model &depvar (event='1') = &VARLIST / dist=binomial; partition ROLEVAR=Selected(TRAIN="1" VALIDATE="0"); selection method=LASSO(SELECT=SBC choose=validate stop=none); run;

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

To be honest, all I know about HPGENSELECT is what I have read in the documentation - I've never actually had high-dimensional datasets where I had to do variable selection. Given that caveat, I want to ask about the PARTITION statement you propose.

It looks like there must be a variable called 'Selected' in the data set 'Subsets', which takes on values of 0, 1 and stuff that is not 0 and 1. Those with a 0 go into the test set, those with a 1 go into the validate set, and those with anything else go into the training set, which is what is operated on for variable selection. If Selected happens to only have 0 and 1 values, I don't think the PROC will work, as there would be no observations in the training set. This is how the PARTITION statement works in other regression PROCS.

Good luck on this.

SteveDenham

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Steve,

Thanks again for your feedback. It is greatly appreciated!

I actually got the code below to run without errors and it looks like it put those with Selected=1 into the training data set and those with Selected=0 into the validation set. I got this in the lst file output:

```
* Create list of independent variables;
proc sql;
select distinct
field into: ind_vars separated by ' '
from model.univariate_&dep_var
;quit;
* Randomly select 80% for training data
proc surveyselect data=out.final
method=srs
seed=43543
outall
samprate=0.8
out=subsets;
run;
ods listing;
proc hpgenselect data=subsets lassosteps=50;
model case (event='1') = &ind_vars / dist = binomial;
partition rolevar = selected(train = "1" validate ="0");
selection method = lasso(choose = sbc );
performance details;
run;
```

Although there were no errors in the log, the code stopped after Selected Effects: Intercept was entered into the model. Anyone know why that happened?

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Can you share the output? My first thought is that no variables met the LASSO criterion for inclusion under SBC.

SteveDenham

Are you ready for the spotlight? We're accepting content ideas for **SAS Innovate 2025** to be held May 6-9 in Orlando, FL. The call is **open **until September 25. Read more here about **why** you should contribute and **what is in it** for you!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.