I have a dataset that I need help with.
data help;
input modelstep var1 $ var2 $ var3 $ var4 $ var5 $ outcome auc;
datalines;
0 v1 v2 v3 v4 v5 1 0.003
1 v1 v2 v3 '' v5 1 0.004
2 v1 v2 v3 '' '' 1 0.007
3 '' v2 v3 '' '' 1 0.01
4 '' '' v3 '' '' 1 0.02
0 v1 v2 '' v4 v5 2 0.005
1 v1 '' '' v4 v5 2 0.006
2 v1 '' '' '' v5 2 0.02
3 v1 '' '' '' '' 2 0.03
;
run;
This dataset represents an iterative process of backward selection of model variables where
I am required to identify (for each outcome) the model step at which a specified number of variables yields
an AUC difference of 0.01, and then specify the previous model step variables as my model of choice. I want to do that using a lag function to get the table outlined below.
Because every row represents a model step (that is a number of variables leading to the desired AUC difference,
I would like to specify the row just before the model leading to the AUC difference of 0.01 for each outcome.*/
Any help with that, please?
Thanks
data help;
input modelstep var1 $ var2 $ var3 $ var4 $ var5 $ outcome auc;
datalines;
0 v1 v2 v3 v4 v5 1 0.003
1 v1 v2 v3 . v5 1 0.004
2 v1 v2 v3 . . 1 0.007
3 . v2 v3 . . 1 0.01
4 . . v3 . . 1 0.02
0 v1 v2 . v4 v5 2 0.005
1 v1 . . v4 v5 2 0.006
2 v1 . . . v5 2 0.02
3 v1 . . . . 2 0.03
;
run;
data want;
set help;
lag_auc1=lag(auc);
lag_auc2=lag2(auc);
lag_auc3=lag3(auc);
lag_auc4=lag4(auc);
dif_auc1=dif(auc);
dif_auc2=dif2(auc);
dif_auc3=dif3(auc);
dif_auc4=dif4(auc);
if outcome ne lag(outcome) then call missing(lag_auc1,dif_auc1);
if outcome ne lag2(outcome) then call missing(lag_auc2,dif_auc2);
if outcome ne lag3(outcome) then call missing(lag_auc3,dif_auc3);
if outcome ne lag4(outcome) then call missing(lag_auc4,dif_auc4);
run;
Thank you so much for the feedback. I have one more question.
If I wanted to specify the model step before which a 'dif_auc' was equal to 0.01, how can I code for that?
I am guessing:
if dif_auc= 0.01 then modelstep= _N_-1
I am not sure.
You want this ?
data help; input modelstep var1 $ var2 $ var3 $ var4 $ var5 $ outcome auc; datalines; 0 v1 v2 v3 v4 v5 1 0.003 1 v1 v2 v3 . v5 1 0.004 2 v1 v2 v3 . . 1 0.007 3 . v2 v3 . . 1 0.01 4 . . v3 . . 1 0.02 0 v1 v2 . v4 v5 2 0.005 1 v1 . . v4 v5 2 0.006 2 v1 . . . v5 2 0.02 3 v1 . . . . 2 0.03 ; run; data want; set help; lag_auc1=lag(auc); lag_auc2=lag2(auc); lag_auc3=lag3(auc); lag_auc4=lag4(auc); dif_auc1=dif(auc); dif_auc2=dif2(auc); dif_auc3=dif3(auc); dif_auc4=dif4(auc); if outcome ne lag(outcome) then call missing(lag_auc1,dif_auc1); if outcome ne lag2(outcome) then call missing(lag_auc2,dif_auc2); if outcome ne lag3(outcome) then call missing(lag_auc3,dif_auc3); if outcome ne lag4(outcome) then call missing(lag_auc4,dif_auc4); lag_modelstep=lag(modelstep); if round(dif_auc1,1e-6) ne 0.01 or outcome ne lag(outcome) then call missing(lag_modelstep); run;
Combine DO UNTIL() loops with BY processing instead of awkward LAG or DIF functions :
data want;
do until(last.outcome);
set help; by outcome;
if auc < 0.01 then lastStep = modelstep;
end;
do until(last.outcome);
set help; by outcome;
if modelstep = lastStep then output;
end;
drop lastStep;
run;
proc print noobs data=want; run;
All auc < 0.01 and no auc < 0.01 cases will be handled appropriately, i.e. by returning the last model and no model, respectively.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 16. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.