BookmarkSubscribeRSS Feed
csetzkorn
Lapis Lazuli | Level 10

I would like to split a given dataset into n stratified equal sized-ish folds by amending it with an additional column containing n. What is a common/simple way to achieve this? Thanks. 

11 REPLIES 11
Reeza
Super User

Generally, it's not a good idea. 

 

That being said, here's two write ups on it. 

 

1. http://www.sascommunity.org/wiki/Split_Data_into_Subsets

2. https://blogs.sas.com/content/sasdummy/2015/01/26/how-to-split-one-data-set-into-many/

csetzkorn
Lapis Lazuli | Level 10
I came across these references when I googled. They do not seem to split the original dataset randomly but rather based of column values. I would like to split randomly never mind stratification.
Reeza
Super User

PROC SURVEYSELECT then? Choose N samples of X data?

csetzkorn
Lapis Lazuli | Level 10
Thanks. I thought about using PROC SURVEYSELECT. Can you please make an example for 3 folds? source dataset: Have, output datasets: wants1, wants2, wants3? Thanks
Reeza
Super User

It won't create multiple data sets but will do the random selection. Then you can use the methods above to split.

Or the manual way of adding a random numbers, sort by random number and use any of the methods in the link above.

csetzkorn
Lapis Lazuli | Level 10
That's fair enough but can you please show code that creates computed column with 1, 2 and 3 in it to indicate fold (see original question).
Reeza
Super User

You should have enough information and samples here to write the sample code yourself or at minimum provide sample data

ballardw
Super User

No guarantee about randomish result but

 

data want;

    set have;

    split = mod(_n_, 9);

run;

will add a variable that will split the data set into in 9 parts and the size difference will be plus/minus 1 between any groups. Replace 9 with your desired number.

If randomization critical then add a variable the result of a random number function, sort by that variable and then use the method above.

 

csetzkorn
Lapis Lazuli | Level 10
Thanks randomness is important
Ksharp
Super User
data have;
 set sashelp.heart;
 call streaminit(12345678);
 random=rand('uniform');
run;

proc rank data=have out=temp groups=3;
var random;
ranks group;
run;
data want1 want2 want3;
 set temp;
 if group=0 then output want1;
  else  if group=1 then output want2;
   else output want3;
run;
mkeintz
PROC Star

Using call rantbl, with regular updating of table probabilities will allow a single-step solution, creating the new variable subgroup.

 

data want (drop=_:);
  set have nobs=nrecs;

  array needed{10} _temporary_;
  array needprob{10} _temporary_;

  if _n_=1 then do;
    do _I=1 to dim(needed); 
      needed{_I}=floor(nrecs/dim(needed));
    end;
    do _I=1 to dim(needed) while (sum(of needed{*})<nrecs);
      needed{_I}=needed{_I}+1;
    end;
  end;

  _nleft = nrecs-(_n_-1);
  do _I=1 to dim(needed);
    needprob{_I}=needed{_I}/_nleft;
  end;

  seed=1250666;
  call rantbl(_seed,of needprob{*},subgroup);

  needed{subgroup}=needed{subgroup}-1;
run;

 

Notes:

  1. Changing the dimension of arrays NEEDED and NEEDPROB is all that's required to change the number of randomly populated subgroups.
  2. NEEDED tracks, for each subgroup, the number of observations yet to be added.  It's dynamically updated with every incoming observations.   The minimum and maximum starting values for NEEDED will differ by no more than one, and will start out summing to NRECS.
  3. NEEDPROB array is required by the CALL RANTBL routine.  It uses elements of NEEDED divided by the number of observations remaining to be assigned.
--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 11 replies
  • 2793 views
  • 0 likes
  • 5 in conversation