How do I randomly split a dataset with1399 unique observations into 2 datasets with 1000 vs. 399 obs?
PROC SURVEYSELECT is typically used for sample selection.
In your case, because you have two groups that are mutually exclusive, you can use the OUTALL option to output all records.
So you select a sample of 1000, but all records are outputted with a variable called SELECTED that will indicate if a variable is in the sample.
Illustrated here, using SASHELP.STOCKS, with a sample of 300 and the remainder, 399 in the second group. No sorting required, proc contents and proc freq are for illustrative purposes only.
You will want to set a SEED so that your sample is reproducible, ie if you run the exact same data through it again with the same seed it will generate the same sample.
proc contents data=sashelp.stocks;
run;
proc surveyselect data=sashelp.stocks method=srs sampsize=300 out=sample_selected outall seed=50;
run;
proc freq data=sample_selected;
table selected;
run;
@Denali wrote:
How do I randomly split a dataset with1399 unique observations into 2 datasets with 1000 vs. 399 obs?
data have;
do ID = 1 to 1399;
output;
end;
run;
data r1;
set have;
call streaminit(42);
r = rand('uniform');
run;
proc sort data = r1;
by r;
run;
data want1000 want399;
set r1;
drop r;
if 1000 => _N_ then output want1000;
else output want399;
run;
one more without sorting.
data have;
do ID = 1 to 1399;
output;
end;
run;
data want1000 want399;
call streaminit(42);
declare hash H();
H.defineKey("curobs");
H.defineDone();
do while(H.num_items<399);
curobs = rand('integer', 1, 1399);
H.replace();
end;
do until(eof);
set have end=eof curobs=curobs;
if H.check() then output want1000;
else output want399;
end;
stop;
run;
/* test */
proc sql;
select * from want1000
intersect
select * from want399
;
quit;
PROC SURVEYSELECT is typically used for sample selection.
In your case, because you have two groups that are mutually exclusive, you can use the OUTALL option to output all records.
So you select a sample of 1000, but all records are outputted with a variable called SELECTED that will indicate if a variable is in the sample.
Illustrated here, using SASHELP.STOCKS, with a sample of 300 and the remainder, 399 in the second group. No sorting required, proc contents and proc freq are for illustrative purposes only.
You will want to set a SEED so that your sample is reproducible, ie if you run the exact same data through it again with the same seed it will generate the same sample.
proc contents data=sashelp.stocks;
run;
proc surveyselect data=sashelp.stocks method=srs sampsize=300 out=sample_selected outall seed=50;
run;
proc freq data=sample_selected;
table selected;
run;
@Denali wrote:
How do I randomly split a dataset with1399 unique observations into 2 datasets with 1000 vs. 399 obs?
Don't miss out on SAS Innovate - Register now for the FREE Livestream!
Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.