I have a dataset that I need help with.
data help;
input modelstep var1 $ var2 $ var3 $ var4 $ var5 $ outcome auc;
datalines;
0 v1 v2 v3 v4 v5 1 0.003
1 v1 v2 v3 '' v5 1 0.004
2 v1 v2 v3 '' '' 1 0.007
3 '' v2 v3 '' '' 1 0.01
4 '' '' v3 '' '' 1 0.02
0 v1 v2 '' v4 v5 2 0.005
1 v1 '' '' v4 v5 2 0.006
2 v1 '' '' '' v5 2 0.02
3 v1 '' '' '' '' 2 0.03
;
run;
This dataset represents an iterative process of backward selection of model variables where
I am required to identify (for each outcome) the model step at which a specified number of variables yields
an AUC difference of 0.01, and then specify the previous model step variables as my model of choice. I want to do that using a lag function to get the table outlined below.
Because every row represents a model step (that is a number of variables leading to the desired AUC difference,
I would like to specify the row just before the model leading to the AUC difference of 0.01 for each outcome.*/
Any help with that, please?
Thanks
data help;
input modelstep var1 $ var2 $ var3 $ var4 $ var5 $ outcome auc;
datalines;
0 v1 v2 v3 v4 v5 1 0.003
1 v1 v2 v3 . v5 1 0.004
2 v1 v2 v3 . . 1 0.007
3 . v2 v3 . . 1 0.01
4 . . v3 . . 1 0.02
0 v1 v2 . v4 v5 2 0.005
1 v1 . . v4 v5 2 0.006
2 v1 . . . v5 2 0.02
3 v1 . . . . 2 0.03
;
run;
data want;
set help;
lag_auc1=lag(auc);
lag_auc2=lag2(auc);
lag_auc3=lag3(auc);
lag_auc4=lag4(auc);
dif_auc1=dif(auc);
dif_auc2=dif2(auc);
dif_auc3=dif3(auc);
dif_auc4=dif4(auc);
if outcome ne lag(outcome) then call missing(lag_auc1,dif_auc1);
if outcome ne lag2(outcome) then call missing(lag_auc2,dif_auc2);
if outcome ne lag3(outcome) then call missing(lag_auc3,dif_auc3);
if outcome ne lag4(outcome) then call missing(lag_auc4,dif_auc4);
run;
Thank you so much for the feedback. I have one more question.
If I wanted to specify the model step before which a 'dif_auc' was equal to 0.01, how can I code for that?
I am guessing:
if dif_auc= 0.01 then modelstep= _N_-1
I am not sure.
You want this ?
data help; input modelstep var1 $ var2 $ var3 $ var4 $ var5 $ outcome auc; datalines; 0 v1 v2 v3 v4 v5 1 0.003 1 v1 v2 v3 . v5 1 0.004 2 v1 v2 v3 . . 1 0.007 3 . v2 v3 . . 1 0.01 4 . . v3 . . 1 0.02 0 v1 v2 . v4 v5 2 0.005 1 v1 . . v4 v5 2 0.006 2 v1 . . . v5 2 0.02 3 v1 . . . . 2 0.03 ; run; data want; set help; lag_auc1=lag(auc); lag_auc2=lag2(auc); lag_auc3=lag3(auc); lag_auc4=lag4(auc); dif_auc1=dif(auc); dif_auc2=dif2(auc); dif_auc3=dif3(auc); dif_auc4=dif4(auc); if outcome ne lag(outcome) then call missing(lag_auc1,dif_auc1); if outcome ne lag2(outcome) then call missing(lag_auc2,dif_auc2); if outcome ne lag3(outcome) then call missing(lag_auc3,dif_auc3); if outcome ne lag4(outcome) then call missing(lag_auc4,dif_auc4); lag_modelstep=lag(modelstep); if round(dif_auc1,1e-6) ne 0.01 or outcome ne lag(outcome) then call missing(lag_modelstep); run;
Combine DO UNTIL() loops with BY processing instead of awkward LAG or DIF functions :
data want;
do until(last.outcome);
set help; by outcome;
if auc < 0.01 then lastStep = modelstep;
end;
do until(last.outcome);
set help; by outcome;
if modelstep = lastStep then output;
end;
drop lastStep;
run;
proc print noobs data=want; run;
All auc < 0.01 and no auc < 0.01 cases will be handled appropriately, i.e. by returning the last model and no model, respectively.
Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.