How do I randomly split a dataset with1399 unique observations into 2 datasets with 1000 vs. 399 obs?
PROC SURVEYSELECT is typically used for sample selection.
In your case, because you have two groups that are mutually exclusive, you can use the OUTALL option to output all records.
So you select a sample of 1000, but all records are outputted with a variable called SELECTED that will indicate if a variable is in the sample.
Illustrated here, using SASHELP.STOCKS, with a sample of 300 and the remainder, 399 in the second group. No sorting required, proc contents and proc freq are for illustrative purposes only.
You will want to set a SEED so that your sample is reproducible, ie if you run the exact same data through it again with the same seed it will generate the same sample.
proc contents data=sashelp.stocks;
run;
proc surveyselect data=sashelp.stocks method=srs sampsize=300 out=sample_selected outall seed=50;
run;
proc freq data=sample_selected;
table selected;
run;
@Denali wrote:
How do I randomly split a dataset with1399 unique observations into 2 datasets with 1000 vs. 399 obs?
data have;
do ID = 1 to 1399;
output;
end;
run;
data r1;
set have;
call streaminit(42);
r = rand('uniform');
run;
proc sort data = r1;
by r;
run;
data want1000 want399;
set r1;
drop r;
if 1000 => _N_ then output want1000;
else output want399;
run;
one more without sorting.
data have;
do ID = 1 to 1399;
output;
end;
run;
data want1000 want399;
call streaminit(42);
declare hash H();
H.defineKey("curobs");
H.defineDone();
do while(H.num_items<399);
curobs = rand('integer', 1, 1399);
H.replace();
end;
do until(eof);
set have end=eof curobs=curobs;
if H.check() then output want1000;
else output want399;
end;
stop;
run;
/* test */
proc sql;
select * from want1000
intersect
select * from want399
;
quit;
PROC SURVEYSELECT is typically used for sample selection.
In your case, because you have two groups that are mutually exclusive, you can use the OUTALL option to output all records.
So you select a sample of 1000, but all records are outputted with a variable called SELECTED that will indicate if a variable is in the sample.
Illustrated here, using SASHELP.STOCKS, with a sample of 300 and the remainder, 399 in the second group. No sorting required, proc contents and proc freq are for illustrative purposes only.
You will want to set a SEED so that your sample is reproducible, ie if you run the exact same data through it again with the same seed it will generate the same sample.
proc contents data=sashelp.stocks;
run;
proc surveyselect data=sashelp.stocks method=srs sampsize=300 out=sample_selected outall seed=50;
run;
proc freq data=sample_selected;
table selected;
run;
@Denali wrote:
How do I randomly split a dataset with1399 unique observations into 2 datasets with 1000 vs. 399 obs?
Available on demand!
Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.