topic Re: Randomly splitting data for training and data set for conditional logistic regression in Statistical Procedures

Randomly splitting data for training and data set for conditional logistic regression

acma — Fri, 28 Oct 2016 10:02:32 GMT

Hi all,

I have a big data set for conditional logistic regression where I want to split it into two sets: train and test. Data format as follow:

ID Y X

1 1 10

1 0 12

1 0 13

2 0 20

2 1 5

10000 0 11

10000 0 8

10000 1 16

10000 0 14

What I want is randomly pick ID with a ratio say, 7:3 on 10000 ID for train:test, and obtaining all the rows with the same ID.

Menawhile, how can I compute the predicted probability after running proc logistic procedure with strata ID ?

Thank you for your kind assistant.

Re: Randomly splitting data for training and data set for conditional logistic regression

Kurt_Bremser — Fri, 28 Oct 2016 10:29:27 GMT

First, build a table with distinct IDs

proc sort data=have (keep=id) out=id nodupkey;
by id;
run;

proc sql;
create table id as
select distinct id
from have
;
quit;

or, if have is already sorted

data id;
set have (keep=id);
by if;
if first.id;
run;

Separate that into two datasets:

data train test;
set id;
if rand('uniform') <= 0.3
then output test;
else output train;
run;

Then you can merge back into your original dataset.

Depending on the state of your original dataset, you could create the lookup datasets by combining steps 3 & 4.

Re: Randomly splitting data for training and data set for conditional logistic regression

Astounding — Fri, 28 Oct 2016 11:47:12 GMT

While I wouldn't be surprised if PROC SURVEYSELECT can do this, you can certainly cut down the number of steps:

If not already sorted, start there:

proc sort data=want;

by id;

run;

Then just a single step will split the data:

data train test;

set have;

by id;

if first.id then do;

if ranuni(12345) < 0.7 then destination = 'train';

else destination = 'test';

retain destination;

end;

if destination = 'train' then output train;

else output test;

drop destination;

run;

Re: Randomly splitting data for training and data set for conditional logistic regression

Ksharp — Sat, 29 Oct 2016 05:45:56 GMT

data have;
 do id=1 to 100;
  do x=1 to 10;
   output;
  end;
 end;
run;

data train test;
 set have;
 by id;
 retain idx;
 if first.id then idx=ceil(100*rand('uniform'));
 if idx le 30 then output test;
  else output train;
drop idx;
run;

Re: Randomly splitting data for training and data set for conditional logistic regression

Zard — Wed, 11 Oct 2017 16:16:59 GMT

"What I want is randomly pick ID with a ratio say, 7:3 on 10000 ID for train:test, and obtaining all the rows with the same ID."

You can do this directly with PROC SURVEYSELECT now, using the SAMPLINGUNIT statement. For example:

proc surveyselect data=have out=want method=srs samprate=0.70

outall seed=12345 noprint;

samplingunit id;

run;

The OUTALL option outputs both the selected and unselected units. The automatic output variable SELECTED equals 1 for the selected units and 0 for the unselected units. In this case, the units are the ID's. 70% of the ID values are randomly selected, and each sample ID includes all the observations for that ID value.