BookmarkSubscribeRSS Feed
dinesh_ltjd2
Calcite | Level 5

Hi All,

 

I am trying to split my dataset in 4 splits like 10%, 20% , 30%, 40%

 

Please help

 

Thanks

DInesh

14 REPLIES 14
RW9
Diamond | Level 26 RW9
Diamond | Level 26

Proc surveyselect is what you want:

proc surveyselect data=Customers
   method=srs n=100 out=SampleSRS;
run;

You can do various methods of selecting:

https://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_surveyselec...

dinesh_ltjd2
Calcite | Level 5

Thanks RW9!!

Will this support multiple values in sample size like 

 

samprate = (0.10 0.20 0.30 0.40)
RW9
Diamond | Level 26 RW9
Diamond | Level 26

Apparently so:
https://support.sas.com/documentation/cdl/en/statug/63347/HTML/default/viewer.htm#statug_surveyselec...

 

You will get one dataset with a variable for which group, and you can use that to by group processing on.

Ksharp
Super User
data a b c d;
 set sashelp.air;
 call streaminit(123456780);
 n=rand('table',.1,.2,.3);
 if n=1 then output a;
  else if n=2 then output b;
   else if n=3 then output c;
    else output d;
drop n;
run;
art297
Opal | Level 21

I haven't compared this with proc surveyselect, but was intrigued with @Ksharp's suggestion of using rand's table option.

 

Unfortunately, I didn't like the results it produced, as compared with taking matters in one's own hand. I'd suggest comparing the results of the following, as well as those obtained with proc surveyselect.

data forsample;
  set sashelp.class;
  randnum=rand('uniform');
run;

proc sort data=forsample;
  by randnum;
run;

data asample10 asample20 asample30 asample40;
  set forsample nobs=n;
  if _n_ le round(n*.1) then output asample10;
  else if _n_ le round(n*.3) then output asample20;
  else if _n_ le round(n*.6) then output asample30;
  else output asample40;
run;
  
data bsample10 bsample20 bsample30 bsample40;
  set sashelp.class;
  n=rand('table',.1,.2,.3,.4);
  if n=1 then output bsample10;
  else if n=2 then output bsample20;
  else if n=3 then output bsample30;
  else output bsample40;
  drop n;
run;

Art, CEO, AnalystFinder.com

 

mkeintz
PROC Star

@art297

 

What did you not like about the results of  @Ksharp's suggestion?

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
dinesh_ltjd2
Calcite | Level 5
I am still working on my code based on suggestions by @Ksharp & @art297...

Thanks
DInesh
art297
Opal | Level 21

@mkeintz: I ran it a couple of times. In one sample10 had selected two obs and in the other it selected 0 obs. In those same two runs, sample20 selected 1 obs each time. I expected sample10 to always have at least one obs, and sample20 to have at least 3 obs.

 

Art, CEO, AnalystFinder.com

 

Ksharp
Super User

I think @art297 might say RAND('table') is not suited for small table .

dinesh_ltjd2
Calcite | Level 5
My dataset is quite big with ~18MM observations
mkeintz
PROC Star

 

@Ksharp:

 

 

If one want to guarantee exact sample proportions, then just update the ratios as samples are built.  Using your rand/table approach:

 

data a b c d;
 set sashelp.air  nobs=n_avail;
 call streaminit(123456780);
 if _n_=1 then do;
   array need {1:4} _temporary_;
   need{1} = round(.1*n_avail);
   need{2} = round(.2*n_avail);
   need{3} = round(.3*n_avail);
   need{4} = n_avail-sum(of need{*});
 end;

 n=rand('table',need{1}/n_avail,need{2}/n_avail,need{3}/n_avail);
 if n=1 then  output a; 
  else if n=2 then  output b; 
   else if n=3 then   output c;
    else output d;
 need{n}=need{n}-1;
 n_avail+(-1);
drop n;
run;
--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
art297
Opal | Level 21

@mkeintz: That would work and definitely faster than first having to assign random numbers and then sort the file. However, I would have thought that rand's table option would already have such logic built in. Obviously, it doesn't!

 

Art, CEO, AnalystFinder.com

 

mkeintz
PROC Star

@art297

 

"However, I would have thought that rand's table option would already have such logic built in."

 

But the RAND function is just a random number generator.  It doesn't know, and should not assume, that I need to update the probabilities as I progress through the dataset. 

 

It's essentially sampling with replacement (@Ksharp's original post) vs sampling without replacement per my suggestion.

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
art297
Opal | Level 21

@Ksharp: I just ran a test with 3.8 million records. My brute force method selected 380,000, 760,000, 1,140,000 and 1,520,000 records for the four samples. The table method, in turn, selected 380,174, 760,300, 1,141,326 and 1,518,200 records for the four samples.

 

Art, CEO, AnalystFinder.com

 

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 14 replies
  • 4112 views
  • 3 likes
  • 5 in conversation