BookmarkSubscribeRSS Feed
viollete
Calcite | Level 5

Hello,

 

I have this type of longitudinal data (I have over 130 hospitals in my data set):

 

Hospital_id  price ...

1   56   

1   75  

1   45  

1   74

2  52

2  57

2  49

2  75

3  34

3  45

3  56

.........

 

I want to do leave one out cross validation. Something like this:

 

1. split data into train (hospitals 2 and 3) and test (hospital 1).

2. do analysis on train .

3. when i want to split data again into train (hospitals 1 and 3) and test (hospital 2).

 and so on...

 

How automatically to do data splitting?

 

Thanks

2 REPLIES 2
Ksharp
Super User

@Rick_SAS  wrote a blog about it just  a couple of days ago.

I would use proc surveyselect .........

 

 

proc freq data=sashelp.class noprint;
table name/out=key;
run;
data _train _test;
 set key;
 if rand('bern',0.7) then output _train;
 else output _test;
run;
proc sql;
create table train as
 select * from sashelp.class where name in (select name from _train);
 
 
create table test as
 select * from sashelp.class where name in (select name from _test);
quit;
Leave one out CV would be like something: (using KEY table above + CALL EXECUTE the following code )
proc sql;
create table train as
 select * from sashelp.class where name = 'xxxxxxxx'
 
 
create table test as
 select * from sashelp.class where name not = 'xxxxxxxx';
quit;
Kurt_Bremser
Super User

Something like this?

/* create the base data */
data have;
input hosp_id price;
cards;
1 56
1 75
1 45
1 74
2 52
2 57
2 49
2 75
3 34
3 45
3 56
;
run;

/* extract distinct id's */
proc sort
  data=have (keep=hosp_id)
  out=exclusions
  nodupkey
;
by hosp_id;
run;

/* a macro to wrap all the analysis code in, and the split */
%macro analysis(hosp_id);

data
  train
  validate
;
set have;
if hosp_id = &hosp_id
then output validate;
else output train;
run;

/* training and check against validate goes here */

%mend;

/* call the macro repeatedly from the distinct id's */
data _null_;
set exclusions;
call execute('%nrstr(%analysis(' !! put(hosp_id,best.) !! '));');
run;

When you run the code, you can see in the log that three different sets of train/validate data are created.

Make sure that each individual call of the macro creates a separate set of result datasets, or you will only get the result of the last iteration.