## Splitting a dataset into equal groups, but with a by statement

Solved
Occasional Contributor
Posts: 7

# Splitting a dataset into equal groups, but with a by statement

Hallo,

i came accross the macro "split2" which splits a big dataset into subgroups. The macro is also included below the message, see also http://www2.sas.com/proceedings/sugi28/075-28.pdf.

It works fine to split my dataset ANALYSIS into subgroups named SAMPLE_I, with the number of I's (num) predetermined.
However, i try to carry out simulation runs, with each simulation run containing its own ANALYSIS dataset. So i need to run the same macro for each simulation run, separately.

i know i need to use a BY statement, but i was not able to include the BY RUN correctly.

Anyone knows how to adjust the macro below by including the BY RUN statement?

Any help is really appreciated.

%macro split2(num);
data _null_;
if 0 then set ANALYSIS nobs=count;
call symput('numobs',put(count,8.));
run;
%let n=%sysevalf(&numobs/&num,ceil);
data %do J=1 %to &num ; SAMPLE_&J %end; ;
set ANALYSIS;
%do I=1 %to &num;
if %eval(&n*(&i-1)) <_n_ <= %eval(&n*&I)
then output SAMPLE_&I;
%end;
run;
%mend split2;

%split2(&GROUPS);

Accepted Solutions
Solution
‎11-06-2017 01:42 PM
Occasional Contributor
Posts: 7

## Re: Splitting a dataset into equal groups, but with a by statement

[ Edited ]

For those who are interested: i managed to find some kind of solution by adjusting the macro "SPLIT2", see below. This results in an extra index for simulation run, beside the number of the split dataset of course. You later combine all using ARRAYME.

Thanks everybody for the effort.

%DO SIM = 1 %TO &SIMS;

DATA ANALYSIS_&SIM;
SET ANALYSIS;
WHERE SIM = &SIM;
RUN;

%END;

%macro split2(num,SIMS);
%DO SIM = 1 %TO &SIMS;
data _null_;
if 0 then set ANALYSIS_&SIM nobs=count;
call symput('numobs',put(count,8.));
run;
%let n=%sysevalf(&numobs/&num,ceil);
data %do J=1 %to &num ;
SAMPLE_&SIM&J %end; ;
set ANALYSIS_&SIM;
%do I=1 %to &num;
if %eval(&n*(&i-1)) <_n_ <= %eval(&n*&I)
then output SAMPLE_&SIM&I;
%end;
run;
%END;

%mend split2;

All Replies
Super User
Posts: 9,855

## Re: Splitting a dataset into equal groups, but with a by statement

Why do you need to split your data out?  Can you not just assign various flags within the data you have now, and then use the flags to where clause the data out as needed.  In that way you can also add group variables so you can give by groups as well.  Its rarely a good idea to split out your data.

```data want;
set have;
group=ifn(_n_<100,1,2);
if a=1 and b=2 then flag1=1;
if a=2 and b=2 then flag2=1;
...
run;```
SAS Super FREQ
Posts: 4,277

## Re: Splitting a dataset into equal groups, but with a by statement

I don't think you want to split your data to run a simulation. See the article "Simulation in SAS: The slow way or the BY way" for tips about how to use BY processing to speed up your simulation study.

Occasional Contributor
Posts: 7

## Re: Splitting a dataset into equal groups, but with a by statement

Dear Rick,

i do not want to split to run the simulation. The dataset ANALYSIS is already a dataset simulated say 1000 times, with an index column for simulation (I=1,2,...,1000).

What i need to do is to split the simulated dataset ANALYSIS in say 10 sub-datasets, but that needs to be done for each simulation run separately, i.e. the splitting into 10 sub-datasets needs to be done 1000 times resulting in 10000 sub-datasets in total,

The macro "split2" works fine in case of one simulation run. i was wondering whether it was possible to include a BY statement inside the macro "split2, that is all.

Super User
Posts: 24,026

## Re: Splitting a dataset into equal groups, but with a by statement

@Osama_Almalik in general, splitting data sets is not recommended for processing. Unless you have a really really large data set you're not likely to find any efficiency in this approach.

http://www2.sas.com/proceedings/forum2007/183-2007.pdf

1000 subsets sounds likes a recipe for mistakes...

Occasional Contributor
Posts: 7

## Re: Splitting a dataset into equal groups, but with a by statement

Hi Reeza,

Actually i am dealing with a huge dataset.

Super User
Posts: 24,026

## Re: Splitting a dataset into equal groups, but with a by statement

Well, we've told you this approach isn't the most efficient. Here's how you split data sets...

I don't think the %split macro can easily be modified, but there are many solutions out there to split BY groups.

See the examples here or search on here for further examples.

https://blogs.sas.com/content/sasdummy/2015/01/26/how-to-split-one-data-set-into-many/

http://www.sascommunity.org/wiki/Split_Data_into_Subsets

Note that once you go this route you'll need to write macros to process everything so you'll need to brush up on that anyways. So writing this first macro will be a good practice exercise.

Good Luck.

SAS Super FREQ
Posts: 4,277

## Re: Splitting a dataset into equal groups, but with a by statement

When I hear "split into 10 subsamples," I tend to think about 10-fold cross-validation. But cross-validation would be done within each (simulated) sample, not across samples. Can you explain what problem you are trying to solve? What is the statistical question you are trying to answer?

Occasional Contributor
Posts: 7

## Re: Splitting a dataset into equal groups, but with a by statement

[ Edited ]

Dear Rick,

i attached a simplified macro of what i am doing.

If you run the macro, you will get a dataset named "example_rick" with three columns: sim (simulation run, only two runs for simplification), factor: 50 levels per simulation run, and response per level of factor.

What i am trying is basicly:

-Per simulation run separately: create 5 sub-datasets each containing 10 factors, simply based on order. So the first subdatset will contain the factors, 1 to 10 the second subdataset factors 11 to 20, etc.

As mentioned earlier, the macro "split2" works fine in case you have only one simulation run. i would love to include a by statement in that macro "Split2", but i have not been able to do that.

Thanks a lot for your time.

SAS Super FREQ
Posts: 4,277

## Re: Splitting a dataset into equal groups, but with a by statement

Let me ask again. Can you explain what problem you are trying to solve? What is the statistical question you are trying to answer?  For example, "I am simulating an ANOVA model and I want to approximate the sampling distribution of the test statistics for the XYZ test under the assumption that....."

Occasional Contributor
Posts: 7

## Re: Splitting a dataset into equal groups, but with a by statement

Dear Rick,

Actually i am not at liberty to disclose that, and to be honest i do not see how that is relevant for this specific question.

But let's say this: For each splitted data set, i want to test whether there is a significant difference in response between the 10 factors, assuming the dataset can be expanded by adding repetitions per factor (e.g. 3 measurements per factor instead of 1).

SAS Super FREQ
Posts: 4,277

## Re: Splitting a dataset into equal groups, but with a by statement

The relevance of my question is that sometimes people ask "How do I do [a complicated or inefficient thing] with the software?" when a better question might be "I want to solve the following problem: .... What is the best way to accomplish that with the software?" Sometimes we (myself included) try to solve a complex programming task because we are not aware that SAS provides a more elegant solution. I am always grateful when someone can steer me in a more productive direction.

I understand if you can't share details. Good luck in your analysis. If you are planning to do a lot of simulations in SAS, you might be interested in the book Simulating Data with SAS.

Occasional Contributor
Posts: 7

## Re: Splitting a dataset into equal groups, but with a by statement

i can not share the details of my reseach. Hopefully someone else might have a solution for this problem.

Solution
‎11-06-2017 01:42 PM
Occasional Contributor
Posts: 7

## Re: Splitting a dataset into equal groups, but with a by statement

[ Edited ]

For those who are interested: i managed to find some kind of solution by adjusting the macro "SPLIT2", see below. This results in an extra index for simulation run, beside the number of the split dataset of course. You later combine all using ARRAYME.

Thanks everybody for the effort.

%DO SIM = 1 %TO &SIMS;

DATA ANALYSIS_&SIM;
SET ANALYSIS;
WHERE SIM = &SIM;
RUN;

%END;

%macro split2(num,SIMS);
%DO SIM = 1 %TO &SIMS;
data _null_;
if 0 then set ANALYSIS_&SIM nobs=count;
call symput('numobs',put(count,8.));
run;
%let n=%sysevalf(&numobs/&num,ceil);
data %do J=1 %to &num ;
SAMPLE_&SIM&J %end; ;
set ANALYSIS_&SIM;
%do I=1 %to &num;
if %eval(&n*(&i-1)) <_n_ <= %eval(&n*&I)
then output SAMPLE_&SIM&I;
%end;
run;
%END;

%mend split2;

☑ This topic is solved.