Programming the statistical procedures from SAS

Randomly splitting data for training and data set for conditional logistic regression

Accepted Solution Solved
Reply
New Contributor
Posts: 3
Accepted Solution

Randomly splitting data for training and data set for conditional logistic regression

Hi all,

 

I have a big data set for conditional logistic regression where I want to split it into two sets: train and test.  Data format as follow:

ID Y X

1 1 10

1 0 12

1 0 13

2 0 20

2 1 5

.

.

10000 0 11

10000 0 8

10000 1 16

10000 0 14

What I want is randomly pick ID with a ratio say, 7:3 on 10000 ID for train:test, and obtaining all the rows with the same ID.

 

Menawhile, how can I compute the predicted probability after running proc logistic procedure with strata ID ?

 

Thank you for your kind assistant.

 


Accepted Solutions
Solution
‎10-19-2017 01:02 PM
SAS Employee
Posts: 3

Re: Randomly splitting data for training and data set for conditional logistic regression

[ Edited ]

"What I want is randomly pick ID with a ratio say, 7:3 on 10000 ID for train:test, and obtaining all the rows with the same ID."

 

You can do this directly with PROC SURVEYSELECT now, using the SAMPLINGUNIT statement. For example:

 

proc surveyselect data=have out=want method=srs samprate=0.70

         outall seed=12345 noprint;

  samplingunit id;

run;

 

The OUTALL option outputs both the selected and unselected units. The automatic output variable SELECTED equals 1 for the selected units and 0 for the unselected units. In this case, the units are the ID's. 70% of the ID values are randomly selected, and each sample ID includes all the observations for that ID value.

View solution in original post


All Replies
Super User
Posts: 8,381

Re: Randomly splitting data for training and data set for conditional logistic regression

[ Edited ]

First, build a table with distinct IDs

proc sort data=have (keep=id) out=id nodupkey;
by id;
run;

or

proc sql;
create table id as
select distinct id
from have
;
quit;

or, if have is already sorted

data id;
set have (keep=id);
by if;
if first.id;
run;

Separate that into two datasets:

data train test;
set id;
if rand('uniform') <= 0.3
then output test;
else output train;
run;

Then you can merge back into your original dataset.

Depending on the state of your original dataset, you could create the lookup datasets by combining steps 3 & 4.

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Super User
Posts: 5,746

Re: Randomly splitting data for training and data set for conditional logistic regression

Posted in reply to KurtBremser

While I wouldn't be surprised if PROC SURVEYSELECT can do this, you can certainly cut down the number of steps:

 

If not already sorted, start there:

 

proc sort data=want;

by id;

run;

 

Then just a single step will split the data:

 

data train test;

set have;

by id;

if first.id then do;

   if ranuni(12345) < 0.7 then destination = 'train';

   else destination = 'test';

   retain destination;

end;

if destination = 'train' then output train;

else output test;

drop destination;

run;

Super User
Posts: 10,219

Re: Randomly splitting data for training and data set for conditional logistic regression

data have;
 do id=1 to 100;
  do x=1 to 10;
   output;
  end;
 end;
run;

data train test;
 set have;
 by id;
 retain idx;
 if first.id then idx=ceil(100*rand('uniform'));
 if idx le 30 then output test;
  else output train;
drop idx;
run;

Solution
‎10-19-2017 01:02 PM
SAS Employee
Posts: 3

Re: Randomly splitting data for training and data set for conditional logistic regression

[ Edited ]

"What I want is randomly pick ID with a ratio say, 7:3 on 10000 ID for train:test, and obtaining all the rows with the same ID."

 

You can do this directly with PROC SURVEYSELECT now, using the SAMPLINGUNIT statement. For example:

 

proc surveyselect data=have out=want method=srs samprate=0.70

         outall seed=12345 noprint;

  samplingunit id;

run;

 

The OUTALL option outputs both the selected and unselected units. The automatic output variable SELECTED equals 1 for the selected units and 0 for the unselected units. In this case, the units are the ID's. 70% of the ID values are randomly selected, and each sample ID includes all the observations for that ID value.

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 4 replies
  • 1426 views
  • 0 likes
  • 5 in conversation