How do I randomly split my dataset ib 4 parts like 10%, 20% , 30%, 40%

Reply
Occasional Contributor
Posts: 5

How do I randomly split my dataset ib 4 parts like 10%, 20% , 30%, 40%

Hi All,

 

I am trying to split my dataset in 4 splits like 10%, 20% , 30%, 40%

 

Please help

 

Thanks

DInesh

Super User
Super User
Posts: 9,599

Re: How do I randomly split my dataset ib 4 parts like 10%, 20% , 30%, 40%

Posted in reply to dinesh_ltjd2

Proc surveyselect is what you want:

proc surveyselect data=Customers
   method=srs n=100 out=SampleSRS;
run;

You can do various methods of selecting:

https://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_surveyselec...

Occasional Contributor
Posts: 5

Re: How do I randomly split my dataset ib 4 parts like 10%, 20% , 30%, 40%

Thanks RW9!!

Will this support multiple values in sample size like 

 

samprate = (0.10 0.20 0.30 0.40)
Super User
Super User
Posts: 9,599

Re: How do I randomly split my dataset ib 4 parts like 10%, 20% , 30%, 40%

Posted in reply to dinesh_ltjd2

Apparently so:
https://support.sas.com/documentation/cdl/en/statug/63347/HTML/default/viewer.htm#statug_surveyselec...

 

You will get one dataset with a variable for which group, and you can use that to by group processing on.

Super User
Posts: 10,766

Re: How do I randomly split my dataset ib 4 parts like 10%, 20% , 30%, 40%

Posted in reply to dinesh_ltjd2
data a b c d;
 set sashelp.air;
 call streaminit(123456780);
 n=rand('table',.1,.2,.3);
 if n=1 then output a;
  else if n=2 then output b;
   else if n=3 then output c;
    else output d;
drop n;
run;
PROC Star
Posts: 8,163

Re: How do I randomly split my dataset ib 4 parts like 10%, 20% , 30%, 40%

Posted in reply to dinesh_ltjd2

I haven't compared this with proc surveyselect, but was intrigued with @Ksharp's suggestion of using rand's table option.

 

Unfortunately, I didn't like the results it produced, as compared with taking matters in one's own hand. I'd suggest comparing the results of the following, as well as those obtained with proc surveyselect.

data forsample;
  set sashelp.class;
  randnum=rand('uniform');
run;

proc sort data=forsample;
  by randnum;
run;

data asample10 asample20 asample30 asample40;
  set forsample nobs=n;
  if _n_ le round(n*.1) then output asample10;
  else if _n_ le round(n*.3) then output asample20;
  else if _n_ le round(n*.6) then output asample30;
  else output asample40;
run;
  
data bsample10 bsample20 bsample30 bsample40;
  set sashelp.class;
  n=rand('table',.1,.2,.3,.4);
  if n=1 then output bsample10;
  else if n=2 then output bsample20;
  else if n=3 then output bsample30;
  else output bsample40;
  drop n;
run;

Art, CEO, AnalystFinder.com

 

Trusted Advisor
Posts: 1,337

Re: How do I randomly split my dataset ib 4 parts like 10%, 20% , 30%, 40%

@art297

 

What did you not like about the results of  @Ksharp's suggestion?

 

Occasional Contributor
Posts: 5

Re: How do I randomly split my dataset ib 4 parts like 10%, 20% , 30%, 40%

I am still working on my code based on suggestions by @Ksharp & @art297...

Thanks
DInesh
PROC Star
Posts: 8,163

Re: How do I randomly split my dataset ib 4 parts like 10%, 20% , 30%, 40%

Posted in reply to dinesh_ltjd2

@mkeintz: I ran it a couple of times. In one sample10 had selected two obs and in the other it selected 0 obs. In those same two runs, sample20 selected 1 obs each time. I expected sample10 to always have at least one obs, and sample20 to have at least 3 obs.

 

Art, CEO, AnalystFinder.com

 

Super User
Posts: 10,766

Re: How do I randomly split my dataset ib 4 parts like 10%, 20% , 30%, 40%

I think @art297 might say RAND('table') is not suited for small table .

Occasional Contributor
Posts: 5

Re: How do I randomly split my dataset ib 4 parts like 10%, 20% , 30%, 40%

My dataset is quite big with ~18MM observations
Trusted Advisor
Posts: 1,337

Re: How do I randomly split my dataset ib 4 parts like 10%, 20% , 30%, 40%

 

@Ksharp:

 

 

If one want to guarantee exact sample proportions, then just update the ratios as samples are built.  Using your rand/table approach:

 

data a b c d;
 set sashelp.air  nobs=n_avail;
 call streaminit(123456780);
 if _n_=1 then do;
   array need {1:4} _temporary_;
   need{1} = round(.1*n_avail);
   need{2} = round(.2*n_avail);
   need{3} = round(.3*n_avail);
   need{4} = n_avail-sum(of need{*});
 end;

 n=rand('table',need{1}/n_avail,need{2}/n_avail,need{3}/n_avail);
 if n=1 then  output a; 
  else if n=2 then  output b; 
   else if n=3 then   output c;
    else output d;
 need{n}=need{n}-1;
 n_avail+(-1);
drop n;
run;
PROC Star
Posts: 8,163

Re: How do I randomly split my dataset ib 4 parts like 10%, 20% , 30%, 40%

@mkeintz: That would work and definitely faster than first having to assign random numbers and then sort the file. However, I would have thought that rand's table option would already have such logic built in. Obviously, it doesn't!

 

Art, CEO, AnalystFinder.com

 

Trusted Advisor
Posts: 1,337

Re: How do I randomly split my dataset ib 4 parts like 10%, 20% , 30%, 40%

@art297

 

"However, I would have thought that rand's table option would already have such logic built in."

 

But the RAND function is just a random number generator.  It doesn't know, and should not assume, that I need to update the probabilities as I progress through the dataset. 

 

It's essentially sampling with replacement (@Ksharp's original post) vs sampling without replacement per my suggestion.

 

 

PROC Star
Posts: 8,163

Re: How do I randomly split my dataset ib 4 parts like 10%, 20% , 30%, 40%

@Ksharp: I just ran a test with 3.8 million records. My brute force method selected 380,000, 760,000, 1,140,000 and 1,520,000 records for the four samples. The table method, in turn, selected 380,174, 760,300, 1,141,326 and 1,518,200 records for the four samples.

 

Art, CEO, AnalystFinder.com

 

Ask a Question
Discussion stats
  • 14 replies
  • 172 views
  • 3 likes
  • 5 in conversation