Solved
New Contributor
Posts: 3

# Randomly splitting data for training and data set for conditional logistic regression

Hi all,

I have a big data set for conditional logistic regression where I want to split it into two sets: train and test.  Data format as follow:

ID Y X

1 1 10

1 0 12

1 0 13

2 0 20

2 1 5

.

.

10000 0 11

10000 0 8

10000 1 16

10000 0 14

What I want is randomly pick ID with a ratio say, 7:3 on 10000 ID for train:test, and obtaining all the rows with the same ID.

Menawhile, how can I compute the predicted probability after running proc logistic procedure with strata ID ?

Thank you for your kind assistant.

Accepted Solutions
Solution
‎10-19-2017 01:02 PM
SAS Employee
Posts: 3

## Re: Randomly splitting data for training and data set for conditional logistic regression

[ Edited ]

"What I want is randomly pick ID with a ratio say, 7:3 on 10000 ID for train:test, and obtaining all the rows with the same ID."

You can do this directly with PROC SURVEYSELECT now, using the SAMPLINGUNIT statement. For example:

proc surveyselect data=have out=want method=srs samprate=0.70

outall seed=12345 noprint;

samplingunit id;

run;

The OUTALL option outputs both the selected and unselected units. The automatic output variable SELECTED equals 1 for the selected units and 0 for the unselected units. In this case, the units are the ID's. 70% of the ID values are randomly selected, and each sample ID includes all the observations for that ID value.

All Replies
Super User
Posts: 8,381

## Re: Randomly splitting data for training and data set for conditional logistic regression

[ Edited ]

First, build a table with distinct IDs

``````proc sort data=have (keep=id) out=id nodupkey;
by id;
run;``````

or

``````proc sql;
create table id as
select distinct id
from have
;
quit;``````

or, if have is already sorted

``````data id;
set have (keep=id);
by if;
if first.id;
run;``````

Separate that into two datasets:

``````data train test;
set id;
if rand('uniform') <= 0.3
then output test;
else output train;
run;``````

Then you can merge back into your original dataset.

Depending on the state of your original dataset, you could create the lookup datasets by combining steps 3 & 4.

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Super User
Posts: 5,746

## Re: Randomly splitting data for training and data set for conditional logistic regression

While I wouldn't be surprised if PROC SURVEYSELECT can do this, you can certainly cut down the number of steps:

If not already sorted, start there:

proc sort data=want;

by id;

run;

Then just a single step will split the data:

data train test;

set have;

by id;

if first.id then do;

if ranuni(12345) < 0.7 then destination = 'train';

else destination = 'test';

retain destination;

end;

if destination = 'train' then output train;

else output test;

drop destination;

run;

Super User
Posts: 10,219

## Re: Randomly splitting data for training and data set for conditional logistic regression

```data have;
do id=1 to 100;
do x=1 to 10;
output;
end;
end;
run;

data train test;
set have;
by id;
retain idx;
if first.id then idx=ceil(100*rand('uniform'));
if idx le 30 then output test;
else output train;
drop idx;
run;

```
Solution
‎10-19-2017 01:02 PM
SAS Employee
Posts: 3

## Re: Randomly splitting data for training and data set for conditional logistic regression

[ Edited ]

"What I want is randomly pick ID with a ratio say, 7:3 on 10000 ID for train:test, and obtaining all the rows with the same ID."

You can do this directly with PROC SURVEYSELECT now, using the SAMPLINGUNIT statement. For example:

proc surveyselect data=have out=want method=srs samprate=0.70

outall seed=12345 noprint;

samplingunit id;

run;

The OUTALL option outputs both the selected and unselected units. The automatic output variable SELECTED equals 1 for the selected units and 0 for the unselected units. In this case, the units are the ID's. 70% of the ID values are randomly selected, and each sample ID includes all the observations for that ID value.

☑ This topic is solved.