Hi All,
I am trying to split my dataset in 4 splits like 10%, 20% , 30%, 40%
Please help
Thanks
DInesh
Proc surveyselect is what you want:
proc surveyselect data=Customers method=srs n=100 out=SampleSRS; run;
You can do various methods of selecting:
Thanks RW9!!
Will this support multiple values in sample size like
samprate = (0.10 0.20 0.30 0.40)
Apparently so:
https://support.sas.com/documentation/cdl/en/statug/63347/HTML/default/viewer.htm#statug_surveyselec...
You will get one dataset with a variable for which group, and you can use that to by group processing on.
data a b c d;
set sashelp.air;
call streaminit(123456780);
n=rand('table',.1,.2,.3);
if n=1 then output a;
else if n=2 then output b;
else if n=3 then output c;
else output d;
drop n;
run;
I haven't compared this with proc surveyselect, but was intrigued with @Ksharp's suggestion of using rand's table option.
Unfortunately, I didn't like the results it produced, as compared with taking matters in one's own hand. I'd suggest comparing the results of the following, as well as those obtained with proc surveyselect.
data forsample;
set sashelp.class;
randnum=rand('uniform');
run;
proc sort data=forsample;
by randnum;
run;
data asample10 asample20 asample30 asample40;
set forsample nobs=n;
if _n_ le round(n*.1) then output asample10;
else if _n_ le round(n*.3) then output asample20;
else if _n_ le round(n*.6) then output asample30;
else output asample40;
run;
data bsample10 bsample20 bsample30 bsample40;
set sashelp.class;
n=rand('table',.1,.2,.3,.4);
if n=1 then output bsample10;
else if n=2 then output bsample20;
else if n=3 then output bsample30;
else output bsample40;
drop n;
run;
Art, CEO, AnalystFinder.com
What did you not like about the results of @Ksharp's suggestion?
@mkeintz: I ran it a couple of times. In one sample10 had selected two obs and in the other it selected 0 obs. In those same two runs, sample20 selected 1 obs each time. I expected sample10 to always have at least one obs, and sample20 to have at least 3 obs.
Art, CEO, AnalystFinder.com
I think @art297 might say RAND('table') is not suited for small table .
If one want to guarantee exact sample proportions, then just update the ratios as samples are built. Using your rand/table approach:
data a b c d;
set sashelp.air nobs=n_avail;
call streaminit(123456780);
if _n_=1 then do;
array need {1:4} _temporary_;
need{1} = round(.1*n_avail);
need{2} = round(.2*n_avail);
need{3} = round(.3*n_avail);
need{4} = n_avail-sum(of need{*});
end;
n=rand('table',need{1}/n_avail,need{2}/n_avail,need{3}/n_avail);
if n=1 then output a;
else if n=2 then output b;
else if n=3 then output c;
else output d;
need{n}=need{n}-1;
n_avail+(-1);
drop n;
run;
@mkeintz: That would work and definitely faster than first having to assign random numbers and then sort the file. However, I would have thought that rand's table option would already have such logic built in. Obviously, it doesn't!
Art, CEO, AnalystFinder.com
"However, I would have thought that rand's table option would already have such logic built in."
But the RAND function is just a random number generator. It doesn't know, and should not assume, that I need to update the probabilities as I progress through the dataset.
It's essentially sampling with replacement (@Ksharp's original post) vs sampling without replacement per my suggestion.
@Ksharp: I just ran a test with 3.8 million records. My brute force method selected 380,000, 760,000, 1,140,000 and 1,520,000 records for the four samples. The table method, in turn, selected 380,174, 760,300, 1,141,326 and 1,518,200 records for the four samples.
Art, CEO, AnalystFinder.com
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.