10-28-2016 06:02 AM
I have a big data set for conditional logistic regression where I want to split it into two sets: train and test. Data format as follow:
ID Y X
1 1 10
1 0 12
1 0 13
2 0 20
2 1 5
10000 0 11
10000 0 8
10000 1 16
10000 0 14
What I want is randomly pick ID with a ratio say, 7:3 on 10000 ID for train:test, and obtaining all the rows with the same ID.
Menawhile, how can I compute the predicted probability after running proc logistic procedure with strata ID ?
Thank you for your kind assistant.
10-28-2016 06:28 AM - edited 10-28-2016 06:29 AM
First, build a table with distinct IDs
proc sort data=have (keep=id) out=id nodupkey; by id; run;
proc sql; create table id as select distinct id from have ; quit;
or, if have is already sorted
data id; set have (keep=id); by if; if first.id; run;
Separate that into two datasets:
data train test; set id; if rand('uniform') <= 0.3 then output test; else output train; run;
Then you can merge back into your original dataset.
Depending on the state of your original dataset, you could create the lookup datasets by combining steps 3 & 4.
10-28-2016 07:47 AM
While I wouldn't be surprised if PROC SURVEYSELECT can do this, you can certainly cut down the number of steps:
If not already sorted, start there:
proc sort data=want;
Then just a single step will split the data:
data train test;
if first.id then do;
if ranuni(12345) < 0.7 then destination = 'train';
else destination = 'test';
if destination = 'train' then output train;
else output test;
10-29-2016 01:45 AM
data have; do id=1 to 100; do x=1 to 10; output; end; end; run; data train test; set have; by id; retain idx; if first.id then idx=ceil(100*rand('uniform')); if idx le 30 then output test; else output train; drop idx; run;