BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
acma
Calcite | Level 5

Hi all,

 

I have a big data set for conditional logistic regression where I want to split it into two sets: train and test.  Data format as follow:

ID Y X

1 1 10

1 0 12

1 0 13

2 0 20

2 1 5

.

.

10000 0 11

10000 0 8

10000 1 16

10000 0 14

What I want is randomly pick ID with a ratio say, 7:3 on 10000 ID for train:test, and obtaining all the rows with the same ID.

 

Menawhile, how can I compute the predicted probability after running proc logistic procedure with strata ID ?

 

Thank you for your kind assistant.

 

1 ACCEPTED SOLUTION

Accepted Solutions
Zard
SAS Employee

"What I want is randomly pick ID with a ratio say, 7:3 on 10000 ID for train:test, and obtaining all the rows with the same ID."

 

You can do this directly with PROC SURVEYSELECT now, using the SAMPLINGUNIT statement. For example:

 

proc surveyselect data=have out=want method=srs samprate=0.70

         outall seed=12345 noprint;

  samplingunit id;

run;

 

The OUTALL option outputs both the selected and unselected units. The automatic output variable SELECTED equals 1 for the selected units and 0 for the unselected units. In this case, the units are the ID's. 70% of the ID values are randomly selected, and each sample ID includes all the observations for that ID value.

View solution in original post

4 REPLIES 4
Kurt_Bremser
Super User

First, build a table with distinct IDs

proc sort data=have (keep=id) out=id nodupkey;
by id;
run;

or

proc sql;
create table id as
select distinct id
from have
;
quit;

or, if have is already sorted

data id;
set have (keep=id);
by if;
if first.id;
run;

Separate that into two datasets:

data train test;
set id;
if rand('uniform') <= 0.3
then output test;
else output train;
run;

Then you can merge back into your original dataset.

Depending on the state of your original dataset, you could create the lookup datasets by combining steps 3 & 4.

Astounding
PROC Star

While I wouldn't be surprised if PROC SURVEYSELECT can do this, you can certainly cut down the number of steps:

 

If not already sorted, start there:

 

proc sort data=want;

by id;

run;

 

Then just a single step will split the data:

 

data train test;

set have;

by id;

if first.id then do;

   if ranuni(12345) < 0.7 then destination = 'train';

   else destination = 'test';

   retain destination;

end;

if destination = 'train' then output train;

else output test;

drop destination;

run;

Ksharp
Super User
data have;
 do id=1 to 100;
  do x=1 to 10;
   output;
  end;
 end;
run;

data train test;
 set have;
 by id;
 retain idx;
 if first.id then idx=ceil(100*rand('uniform'));
 if idx le 30 then output test;
  else output train;
drop idx;
run;

Zard
SAS Employee

"What I want is randomly pick ID with a ratio say, 7:3 on 10000 ID for train:test, and obtaining all the rows with the same ID."

 

You can do this directly with PROC SURVEYSELECT now, using the SAMPLINGUNIT statement. For example:

 

proc surveyselect data=have out=want method=srs samprate=0.70

         outall seed=12345 noprint;

  samplingunit id;

run;

 

The OUTALL option outputs both the selected and unselected units. The automatic output variable SELECTED equals 1 for the selected units and 0 for the unselected units. In this case, the units are the ID's. 70% of the ID values are randomly selected, and each sample ID includes all the observations for that ID value.

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 4 replies
  • 17379 views
  • 2 likes
  • 5 in conversation