Hi, how to do, say 10-fold cross validation with resampling with proc logistic (which apparently is as good as bootstrap with replacements, "..We also carried out cross-validation with replication. Here the cross-validation was replicated r times, with a different random split into k groups each time..." - this is the article, http://m.aje.oxfordjournals.org/content/early/2014/06/24/aje.kwu140.full.pdf ). Help most appreciated!
p/s sorry if i hv the terminologies jumbled up; and no access to hp-procs
Thanks,
Saiful.
It looks like very easy.
%macro k_fold_cv(k=10);
ods select none;
proc surveyselect data=sashelp.heart group=&k out=have;
run;
%do i=1 %to &k ;
data training;
set have(where=(groupid ne &i)) ;
run;
data test;
set have(where=(groupid eq &i));
run;
ods output
Association=native(keep=label2 nvalue2 rename=(nvalue2=native) where=(label2='c'))
ScoreFitStat=true(keep=dataset freq auc rename=(auc=true));
proc logistic data=training
outest=est(keep=_status_ _name_) ;
class sex;
model status(event='Alive')=sex height weight;
score data=test fitstat;
run;
data score&i;
merge true native est;
retain id &i ;
optimism=native-true;
run;
%end;
data k_fold_cv;
set score1-score&k;
run;
ods select all;
%mend;
%k_fold_cv(k=10)
/*************************************/
%macro k_fold_cv_rep(r=1,k=10);
ods select none;
%do r=1 %to &r;
proc surveyselect data=sashelp.heart group=&k out=have;
run;
%do i=1 %to &k ;
data training;
set have(where=(groupid ne &i)) ;
run;
data test;
set have(where=(groupid eq &i));
run;
ods output
Association=native(keep=label2 nvalue2 rename=(nvalue2=native) where=(label2='c'))
ScoreFitStat=true(keep=dataset freq auc rename=(auc=true));
proc logistic data=training
outest=est(keep=_status_ _name_) ;
class sex;
model status(event='Alive')=sex height weight;
score data=test fitstat;
run;
data score_r&r._&i;
merge true native est;
retain rep &r id &i;
optimism=native-true;
run;
%end;
%end;
data k_fold_cv_rep;
set score_r:;
run;
ods select all;
%mend;
%k_fold_cv_rep(r=20,k=10);
/********************/
data all;
set k_fold_cv k_fold_cv_rep indsname=indsn;
length indsname $ 32;
indsname=indsn;
run;
proc summary data=all nway;
class indsname;
var optimism;
output out=want mean=mean lclm=lclm uclm=uclm;
run;
How big is your data?
The methods in this paper are what you're looking for. Essentially, use PROC SURVEYSELECT to generate random samples, run PROC LOGISTIC on the samples using a BY group and then summarize results using PROC SURVEYMEANS OR MEANS.
http://www2.sas.com/proceedings/forum2007/183-2007.pdf
Edit: Realized this is IML so feel free to disregard this message if it's irrelevant, but this would be a perfectly valid way to approach your problem. PS I would find a worked example and work through it to verify that you understand your calculations thoroughly. I once spent 3 days debugging a bootstrap because I didn't realize the denominator was n-1 vs n....
It looks like very easy.
%macro k_fold_cv(k=10);
ods select none;
proc surveyselect data=sashelp.heart group=&k out=have;
run;
%do i=1 %to &k ;
data training;
set have(where=(groupid ne &i)) ;
run;
data test;
set have(where=(groupid eq &i));
run;
ods output
Association=native(keep=label2 nvalue2 rename=(nvalue2=native) where=(label2='c'))
ScoreFitStat=true(keep=dataset freq auc rename=(auc=true));
proc logistic data=training
outest=est(keep=_status_ _name_) ;
class sex;
model status(event='Alive')=sex height weight;
score data=test fitstat;
run;
data score&i;
merge true native est;
retain id &i ;
optimism=native-true;
run;
%end;
data k_fold_cv;
set score1-score&k;
run;
ods select all;
%mend;
%k_fold_cv(k=10)
/*************************************/
%macro k_fold_cv_rep(r=1,k=10);
ods select none;
%do r=1 %to &r;
proc surveyselect data=sashelp.heart group=&k out=have;
run;
%do i=1 %to &k ;
data training;
set have(where=(groupid ne &i)) ;
run;
data test;
set have(where=(groupid eq &i));
run;
ods output
Association=native(keep=label2 nvalue2 rename=(nvalue2=native) where=(label2='c'))
ScoreFitStat=true(keep=dataset freq auc rename=(auc=true));
proc logistic data=training
outest=est(keep=_status_ _name_) ;
class sex;
model status(event='Alive')=sex height weight;
score data=test fitstat;
run;
data score_r&r._&i;
merge true native est;
retain rep &r id &i;
optimism=native-true;
run;
%end;
%end;
data k_fold_cv_rep;
set score_r:;
run;
ods select all;
%mend;
%k_fold_cv_rep(r=20,k=10);
/********************/
data all;
set k_fold_cv k_fold_cv_rep indsname=indsn;
length indsname $ 32;
indsname=indsn;
run;
proc summary data=all nway;
class indsname;
var optimism;
output out=want mean=mean lclm=lclm uclm=uclm;
run;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how to run multiple linear regression models with and without interactions, presented by SAS user Alex Chaplin.
Find more tutorials on the SAS Users YouTube channel.