Programming the statistical procedures from SAS

Randomly splitting data for training and data set for conditional logistic regression

Reply
New Contributor
Posts: 3

Randomly splitting data for training and data set for conditional logistic regression

Hi all,

 

I have a big data set for conditional logistic regression where I want to split it into two sets: train and test.  Data format as follow:

ID Y X

1 1 10

1 0 12

1 0 13

2 0 20

2 1 5

.

.

10000 0 11

10000 0 8

10000 1 16

10000 0 14

What I want is randomly pick ID with a ratio say, 7:3 on 10000 ID for train:test, and obtaining all the rows with the same ID.

 

Menawhile, how can I compute the predicted probability after running proc logistic procedure with strata ID ?

 

Thank you for your kind assistant.

 

Esteemed Advisor
Posts: 6,267

Re: Randomly splitting data for training and data set for conditional logistic regression

[ Edited ]

First, build a table with distinct IDs

proc sort data=have (keep=id) out=id nodupkey;
by id;
run;

or

proc sql;
create table id as
select distinct id
from have
;
quit;

or, if have is already sorted

data id;
set have (keep=id);
by if;
if first.id;
run;

Separate that into two datasets:

data train test;
set id;
if rand('uniform') <= 0.3
then output test;
else output train;
run;

Then you can merge back into your original dataset.

Depending on the state of your original dataset, you could create the lookup datasets by combining steps 3 & 4.

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Respected Advisor
Posts: 4,756

Re: Randomly splitting data for training and data set for conditional logistic regression

While I wouldn't be surprised if PROC SURVEYSELECT can do this, you can certainly cut down the number of steps:

 

If not already sorted, start there:

 

proc sort data=want;

by id;

run;

 

Then just a single step will split the data:

 

data train test;

set have;

by id;

if first.id then do;

   if ranuni(12345) < 0.7 then destination = 'train';

   else destination = 'test';

   retain destination;

end;

if destination = 'train' then output train;

else output test;

drop destination;

run;

Grand Advisor
Posts: 9,447

Re: Randomly splitting data for training and data set for conditional logistic regression

data have;
 do id=1 to 100;
  do x=1 to 10;
   output;
  end;
 end;
run;

data train test;
 set have;
 by id;
 retain idx;
 if first.id then idx=ceil(100*rand('uniform'));
 if idx le 30 then output test;
  else output train;
drop idx;
run;

Ask a Question
Discussion stats
  • 3 replies
  • 376 views
  • 0 likes
  • 4 in conversation