The dataset has 72 unique observations (patients) with ~22 imaging measures of the hip per observation. The outcome is yes/no (n=31 and n=41).
In univariate (bivariate?) analyses prior to model entry, I quantified the c-statistic of all continuous imaging measures in the variable's raw form (i.e., no adjustment/transformation). Still in the uni/bi-variate phase, I then tested if each continuous variable's non-linear form improved the c-statistic (compared to their linear, or un-transformed form) by applying the effect spline statement (KNOTMETHOD=EQUAL(4)). Many of the continuous predictors improved prediction of the binary outcome in this spline form.
Now, I want to perform variable selection on all variables, including 3 categorical variables, several linear continuous variables, and several non-linear (i.e., spline) continuous variables. However, I can't seem to figure out how to treat the spline variables as a single variable to include all together or not, as the spline variables now have 4 categories. The first spline category is always 1, whereas the other 3 categories have the continuous data.
How do I go about this? Below is the two-step code that first uses PROC LOGISTIC to create splines, then PROC HPGENSELECT for LASSO.
***FIRST STEP - creating spline effects for some continuous variables***
proc logistic data=aga outdesign=spline_data;
effect spl_AnterAcetWallIndex=spline(XR_preop_AnterAcetWallIndex / knotmethod=equal(4) naturalcubic);
effect spl_XR_preop_ExtrusionIndex=spline(XR_preop_ExtrusionIndex / knotmethod=equal(4) naturalcubic);
effect spl_XR_preop_NeckShaftAngle=spline(XR_preop_NeckShaftAngle / knotmethod=equal(4) naturalcubic);
effect spl_XR_preop_ArtTrochDist=spline(XR_preop_ArtTrochDist / knotmethod=equal(4) naturalcubic);
effect spl_XR_preop_AcetDepth2Width=spline(XR_preop_AcetDepth2WidthRatio / knotmethod=equal(4) naturalcubic);
effect spl_CT_preop_AV_05cm_above_FHC=spline(CT_preop_AV_05cm_above_FHC / knotmethod=equal(4) naturalcubic);
effect spl_CT_preop_AlphaAngle=spline(CT_preop_AlphaAngle / knotmethod=equal(4) naturalcubic);
class XR_preop_IschSpSign XR_preop_PostWallSign Threed_preop_HetsClass;
model outcome_rev(event='0')=
XR_preop_IschSpSign
XR_preop_PostWallSign
XR_preop_COR
spl_AnterAcetWallIndex
XR_preop_PosterAcetWallIndex
spl_XR_preop_ExtrusionIndex
XR_preop_DistMed_FH2IlioischLine
spl_XR_preop_NeckShaftAngle
spl_XR_preop_ArtTrochDist
spl_XR_preop_AcetDepth2Width
Threed_preop_HetsClass
spl_CT_preop_AV_05cm_above_FHC
CT_preop_NeckShaftAngle
spl_CT_preop_AlphaAngle;
run;
***SECOND STEP - performing LASSO-based variable selection***
proc hpgenselect data=spline_data namelen=60;
model outcome_rev=
XR_preop_IschSpSign0
XR_preop_PostWallSign0
XR_preop_COR
spl_AnterAcetWallIndex1
spl_AnterAcetWallIndex2
spl_AnterAcetWallIndex3
spl_AnterAcetWallIndex4
XR_preop_PosterAcetWallIndex
spl_XR_preop_ExtrusionIndex1
spl_XR_preop_ExtrusionIndex2
spl_XR_preop_ExtrusionIndex3
spl_XR_preop_ExtrusionIndex4
XR_preop_DistMed_FH2IlioischLine
spl_XR_preop_NeckShaftAngle1
spl_XR_preop_NeckShaftAngle2
spl_XR_preop_NeckShaftAngle3
spl_XR_preop_NeckShaftAngle4
spl_XR_preop_ArtTrochDist1
spl_XR_preop_ArtTrochDist2
spl_XR_preop_ArtTrochDist3
spl_XR_preop_ArtTrochDist4
spl_XR_preop_AcetDepth2Width1
spl_XR_preop_AcetDepth2Width2
spl_XR_preop_AcetDepth2Width3
spl_XR_preop_AcetDepth2Width4
Threed_preop_HetsClass1
Threed_preop_HetsClass2
spl_CT_preop_AV_05cm_above_FHC1
spl_CT_preop_AV_05cm_above_FHC2
spl_CT_preop_AV_05cm_above_FHC3
spl_CT_preop_AV_05cm_above_FHC4
CT_preop_NeckShaftAngle
spl_CT_preop_AlphaAngle1
spl_CT_preop_AlphaAngle2
spl_CT_preop_AlphaAngle3
spl_CT_preop_AlphaAngle4 / dist=binomial;
selection method=lasso(maxsteps=60) details=all;
run;
If you use PROC GENSELECT instead of PROC HPGENSELECT, you have an EFFECT statement (with regression splines) and you have a SELECTION statement (with LASSO).
PROC HPGENSELECT has no EFFECT statement.
The prefix "HP" in HPGENSELECT is for High-Performance and it's a multi-threaded procedure. PROC GENSELECT is not.
I guess - given the limited number of observations you have - PROC GENSELECT can do the job in a reasonable time.
BR, Koen
sbxkoenk,
I searched sas doc, there is no such PROC GENSELECT, only have PROC GLMSELECT.
and it is only for continuous variable Y, not for binary variable.
Rick's blog could give you a hint:
https://blogs.sas.com/content/iml/2018/08/01/variables-in-final-selected-model.html
I don't believe it is possible in SAS 9.4 to do Lasso selection for a logistic model and have the spline parameters enter or leave the model as a unit. However, if you have access to SAS Viya, you could use PROC LOGSELECT since it has both the EFFECT statement to define spline effects and the SELECTION statement for Lasso selection. In SAS 9.4, I think the best you can do is to use the selection methods available with the SELECTION= option in PROC LOGISTIC.
Nearly 200 sessions are now available on demand with the SAS Innovate Digital Pass.
Explore Now →ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.