I have a dataset that I need help with.
data help;
input modelstep var1 $ var2 $ var3 $ var4 $ var5 $ outcome auc;
datalines;
0 v1 v2 v3 v4 v5 1 0.003
1 v1 v2 v3 '' v5 1 0.004
2 v1 v2 v3 '' '' 1 0.007
3 '' v2 v3 '' '' 1 0.01
4 '' '' v3 '' '' 1 0.02
0 v1 v2 '' v4 v5 2 0.005
1 v1 '' '' v4 v5 2 0.006
2 v1 '' '' '' v5 2 0.02
3 v1 '' '' '' '' 2 0.03
;
run;
This dataset represents an iterative process of backward selection of model variables where
I am required to identify (for each outcome) the model step at which a specified number of variables yields
an AUC difference of 0.01, and then specify the previous model step variables as my model of choice. I want to do that using a lag function to get the table outlined below.
Because every row represents a model step (that is a number of variables leading to the desired AUC difference,
I would like to specify the row just before the model leading to the AUC difference of 0.01 for each outcome.*/
Any help with that, please?
Thanks
data help;
input modelstep var1 $ var2 $ var3 $ var4 $ var5 $ outcome auc;
datalines;
0 v1 v2 v3 v4 v5 1 0.003
1 v1 v2 v3 . v5 1 0.004
2 v1 v2 v3 . . 1 0.007
3 . v2 v3 . . 1 0.01
4 . . v3 . . 1 0.02
0 v1 v2 . v4 v5 2 0.005
1 v1 . . v4 v5 2 0.006
2 v1 . . . v5 2 0.02
3 v1 . . . . 2 0.03
;
run;
data want;
set help;
lag_auc1=lag(auc);
lag_auc2=lag2(auc);
lag_auc3=lag3(auc);
lag_auc4=lag4(auc);
dif_auc1=dif(auc);
dif_auc2=dif2(auc);
dif_auc3=dif3(auc);
dif_auc4=dif4(auc);
if outcome ne lag(outcome) then call missing(lag_auc1,dif_auc1);
if outcome ne lag2(outcome) then call missing(lag_auc2,dif_auc2);
if outcome ne lag3(outcome) then call missing(lag_auc3,dif_auc3);
if outcome ne lag4(outcome) then call missing(lag_auc4,dif_auc4);
run;
Thank you so much for the feedback. I have one more question.
If I wanted to specify the model step before which a 'dif_auc' was equal to 0.01, how can I code for that?
I am guessing:
if dif_auc= 0.01 then modelstep= _N_-1
I am not sure.
You want this ?
data help; input modelstep var1 $ var2 $ var3 $ var4 $ var5 $ outcome auc; datalines; 0 v1 v2 v3 v4 v5 1 0.003 1 v1 v2 v3 . v5 1 0.004 2 v1 v2 v3 . . 1 0.007 3 . v2 v3 . . 1 0.01 4 . . v3 . . 1 0.02 0 v1 v2 . v4 v5 2 0.005 1 v1 . . v4 v5 2 0.006 2 v1 . . . v5 2 0.02 3 v1 . . . . 2 0.03 ; run; data want; set help; lag_auc1=lag(auc); lag_auc2=lag2(auc); lag_auc3=lag3(auc); lag_auc4=lag4(auc); dif_auc1=dif(auc); dif_auc2=dif2(auc); dif_auc3=dif3(auc); dif_auc4=dif4(auc); if outcome ne lag(outcome) then call missing(lag_auc1,dif_auc1); if outcome ne lag2(outcome) then call missing(lag_auc2,dif_auc2); if outcome ne lag3(outcome) then call missing(lag_auc3,dif_auc3); if outcome ne lag4(outcome) then call missing(lag_auc4,dif_auc4); lag_modelstep=lag(modelstep); if round(dif_auc1,1e-6) ne 0.01 or outcome ne lag(outcome) then call missing(lag_modelstep); run;
Combine DO UNTIL() loops with BY processing instead of awkward LAG or DIF functions :
data want;
do until(last.outcome);
set help; by outcome;
if auc < 0.01 then lastStep = modelstep;
end;
do until(last.outcome);
set help; by outcome;
if modelstep = lastStep then output;
end;
drop lastStep;
run;
proc print noobs data=want; run;
All auc < 0.01 and no auc < 0.01 cases will be handled appropriately, i.e. by returning the last model and no model, respectively.
Don’t miss the livestream kicking off May 7. It’s free. It’s easy. And it’s the best seat in the house.
Join us virtually with our complimentary SAS Innovate Digital Pass. Watch live or on-demand in multiple languages, with translations available to help you get the most out of every session.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.