turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- SAS Programming
- /
- Base SAS Programming
- /
- split dataset into n folds

Topic Options

- RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

12-12-2017 10:58 AM

I would like to split a given dataset into n stratified equal sized-ish folds by amending it with an additional column containing n. What is a common/simple way to achieve this? Thanks.

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to csetzkorn

12-12-2017 11:09 AM

Generally, it's not a good idea.

That being said, here's two write ups on it.

1. http://www.sascommunity.org/wiki/Split_Data_into_Subsets

2. https://blogs.sas.com/content/sasdummy/2015/01/26/how-to-split-one-data-set-into-many/

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to Reeza

12-12-2017 11:28 AM

I came across these references when I googled. They do not seem to split the original dataset randomly but rather based of column values. I would like to split randomly never mind stratification.

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to csetzkorn

12-12-2017 11:30 AM

PROC SURVEYSELECT then? Choose N samples of X data?

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to Reeza

12-12-2017 11:50 AM

Thanks. I thought about using PROC SURVEYSELECT. Can you please make an example for 3 folds? source dataset: Have, output datasets: wants1, wants2, wants3? Thanks

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to csetzkorn

12-12-2017 11:57 AM

It won't create multiple data sets but will do the random selection. Then you can use the methods above to split.

Or the manual way of adding a random numbers, sort by random number and use any of the methods in the link above.

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to Reeza

12-12-2017 12:32 PM

That's fair enough but can you please show code that creates computed column with 1, 2 and 3 in it to indicate fold (see original question).

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to csetzkorn

12-12-2017 12:32 PM

You should have enough information and samples here to write the sample code yourself or at minimum provide sample data

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to csetzkorn

12-12-2017 12:24 PM

No guarantee about randomish result but

data want;

set have;

split = mod(_n_, 9);

run;

will add a variable that will split the data set into in 9 parts and the size difference will be plus/minus 1 between any groups. Replace 9 with your desired number.

If randomization critical then add a variable the result of a random number function, sort by that variable and then use the method above.

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to ballardw

12-12-2017 12:32 PM

Thanks randomness is important

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to csetzkorn

12-13-2017 08:17 AM

```
data have;
set sashelp.heart;
call streaminit(12345678);
random=rand('uniform');
run;
proc rank data=have out=temp groups=3;
var random;
ranks group;
run;
data want1 want2 want3;
set temp;
if group=0 then output want1;
else if group=1 then output want2;
else output want3;
run;
```

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to csetzkorn

12-13-2017 02:25 PM

Using call rantbl, with regular updating of table probabilities will allow a single-step solution, creating the new variable subgroup.

```
data want (drop=_:);
set have nobs=nrecs;
array needed{10} _temporary_;
array needprob{10} _temporary_;
if _n_=1 then do;
do _I=1 to dim(needed);
needed{_I}=floor(nrecs/dim(needed));
end;
do _I=1 to dim(needed) while (sum(of needed{*})<nrecs);
needed{_I}=needed{_I}+1;
end;
end;
_nleft = nrecs-(_n_-1);
do _I=1 to dim(needed);
needprob{_I}=needed{_I}/_nleft;
end;
seed=1250666;
call rantbl(_seed,of needprob{*},subgroup);
needed{subgroup}=needed{subgroup}-1;
run;
```

Notes:

- Changing the dimension of arrays NEEDED and NEEDPROB is all that's required to change the number of randomly populated subgroups.
- NEEDED tracks, for each subgroup, the number of observations yet to be added. It's dynamically updated with every incoming observations. The minimum and maximum starting values for NEEDED will differ by no more than one, and will start out summing to NRECS.
- NEEDPROB array is required by the CALL RANTBL routine. It uses elements of NEEDED divided by the number of observations remaining to be assigned.