Solved: Re: Randomly splitting data for training and data set for conditional ...

acma · Posted 10-28-2016 06:02 AM

Hi all,

I have a big data set for conditional logistic regression where I want to split it into two sets: train and test. Data format as follow:

ID Y X

1 1 10

1 0 12

1 0 13

2 0 20

2 1 5

.

10000 0 11

10000 0 8

10000 1 16

10000 0 14

What I want is randomly pick ID with a ratio say, 7:3 on 10000 ID for train:test, and obtaining all the rows with the same ID.

Menawhile, how can I compute the predicted probability after running proc logistic procedure with strata ID ?

Thank you for your kind assistant.

Zard · Posted 10-11-2017 12:15 PM

"What I want is randomly pick ID with a ratio say, 7:3 on 10000 ID for train:test, and obtaining all the rows with the same ID."

You can do this directly with PROC SURVEYSELECT now, using the SAMPLINGUNIT statement. For example:

proc surveyselect data=have out=want method=srs samprate=0.70

outall seed=12345 noprint;

samplingunit id;

run;

The OUTALL option outputs both the selected and unselected units. The automatic output variable SELECTED equals 1 for the selected units and 0 for the unselected units. In this case, the units are the ID's. 70% of the ID values are randomly selected, and each sample ID includes all the observations for that ID value.

View solution in original post

Kurt_Bremser · Posted 10-28-2016 06:28 AM

First, build a table with distinct IDs

proc sort data=have (keep=id) out=id nodupkey;
by id;
run;

or

proc sql;
create table id as
select distinct id
from have
;
quit;

or, if have is already sorted

data id;
set have (keep=id);
by if;
if first.id;
run;

Separate that into two datasets:

data train test;
set id;
if rand('uniform') <= 0.3
then output test;
else output train;
run;

Then you can merge back into your original dataset.

Depending on the state of your original dataset, you could create the lookup datasets by combining steps 3 & 4.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Astounding · Posted 10-28-2016 07:47 AM

While I wouldn't be surprised if PROC SURVEYSELECT can do this, you can certainly cut down the number of steps:

If not already sorted, start there:

proc sort data=want;

by id;

run;

Then just a single step will split the data:

data train test;

set have;

by id;

if first.id then do;

if ranuni(12345) < 0.7 then destination = 'train';

else destination = 'test';

retain destination;

end;

if destination = 'train' then output train;

else output test;

drop destination;

run;

Ksharp · Posted 10-29-2016 01:45 AM

data have;
 do id=1 to 100;
  do x=1 to 10;
   output;
  end;
 end;
run;

data train test;
 set have;
 by id;
 retain idx;
 if first.id then idx=ceil(100*rand('uniform'));
 if idx le 30 then output test;
  else output train;
drop idx;
run;

Zard · Posted 10-11-2017 12:15 PM

"What I want is randomly pick ID with a ratio say, 7:3 on 10000 ID for train:test, and obtaining all the rows with the same ID."

You can do this directly with PROC SURVEYSELECT now, using the SAMPLINGUNIT statement. For example:

proc surveyselect data=have out=want method=srs samprate=0.70

outall seed=12345 noprint;

samplingunit id;

run;

The OUTALL option outputs both the selected and unselected units. The automatic output variable SELECTED equals 1 for the selected units and 0 for the unselected units. In this case, the units are the ID's. 70% of the ID values are randomly selected, and each sample ID includes all the observations for that ID value.

Randomly splitting data for training and data set for conditional logistic regression

Re: Randomly splitting data for training and data set for conditional logistic regression

Re: Randomly splitting data for training and data set for conditional logistic regression

Re: Randomly splitting data for training and data set for conditional logistic regression

Re: Randomly splitting data for training and data set for conditional logistic regression

Re: Randomly splitting data for training and data set for conditional logistic regression