BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Osama_Almalik
Fluorite | Level 6

Hallo,

i came accross the macro "split2" which splits a big dataset into subgroups. The macro is also included below the message, see also http://www2.sas.com/proceedings/sugi28/075-28.pdf. 

It works fine to split my dataset ANALYSIS into subgroups named SAMPLE_I, with the number of I's (num) predetermined.
However, i try to carry out simulation runs, with each simulation run containing its own ANALYSIS dataset. So i need to run the same macro for each simulation run, separately.

i know i need to use a BY statement, but i was not able to include the BY RUN correctly.

Anyone knows how to adjust the macro below by including the BY RUN statement?

Any help is really appreciated.

 

%macro split2(num);
data _null_;
if 0 then set ANALYSIS nobs=count;
call symput('numobs',put(count,8.));
run;
%let n=%sysevalf(&numobs/&num,ceil);
data %do J=1 %to &num ; SAMPLE_&J %end; ;
set ANALYSIS;
%do I=1 %to #
if %eval(&n*(&i-1)) <_n_ <= %eval(&n*&I)
then output SAMPLE_&I;
%end;
run;
%mend split2;

%split2(&GROUPS);

1 ACCEPTED SOLUTION

Accepted Solutions
Osama_Almalik
Fluorite | Level 6

For those who are interested: i managed to find some kind of solution by adjusting the macro "SPLIT2", see below. This results in an extra index for simulation run, beside the number of the split dataset of course. You later combine all using ARRAYME.

Thanks everybody for the effort.

 

%DO SIM = 1 %TO &SIMS;

DATA ANALYSIS_&SIM;
SET ANALYSIS;
WHERE SIM = &SIM;
RUN;

%END;

 

%macro split2(num,SIMS);
%DO SIM = 1 %TO &SIMS;
data _null_;
if 0 then set ANALYSIS_&SIM nobs=count;
call symput('numobs',put(count,8.));
run;
%let n=%sysevalf(&numobs/&num,ceil);
data %do J=1 %to &num ;
SAMPLE_&SIM&J %end; ;
set ANALYSIS_&SIM;
%do I=1 %to &num;
if %eval(&n*(&i-1)) <_n_ <= %eval(&n*&I)
then output SAMPLE_&SIM&I;
%end;
run;
%END;

%mend split2;

View solution in original post

13 REPLIES 13
RW9
Diamond | Level 26 RW9
Diamond | Level 26

Why do you need to split your data out?  Can you not just assign various flags within the data you have now, and then use the flags to where clause the data out as needed.  In that way you can also add group variables so you can give by groups as well.  Its rarely a good idea to split out your data.

data want;
  set have;
  group=ifn(_n_<100,1,2);
  if a=1 and b=2 then flag1=1;
  if a=2 and b=2 then flag2=1;
...
run;
Rick_SAS
SAS Super FREQ

I don't think you want to split your data to run a simulation. See the article "Simulation in SAS: The slow way or the BY way" for tips about how to use BY processing to speed up your simulation study.

Osama_Almalik
Fluorite | Level 6

Dear Rick,

Thanks for your answer.

i do not want to split to run the simulation. The dataset ANALYSIS is already a dataset simulated say 1000 times, with an index column for simulation (I=1,2,...,1000).

What i need to do is to split the simulated dataset ANALYSIS in say 10 sub-datasets, but that needs to be done for each simulation run separately, i.e. the splitting into 10 sub-datasets needs to be done 1000 times resulting in 10000 sub-datasets in total,

The macro "split2" works fine in case of one simulation run. i was wondering whether it was possible to include a BY statement inside the macro "split2, that is all.

Thanks in advance for any additional ideas.

Reeza
Super User

@Osama_Almalik in general, splitting data sets is not recommended for processing. Unless you have a really really large data set you're not likely to find any efficiency in this approach. 

 

http://www2.sas.com/proceedings/forum2007/183-2007.pdf

 

1000 subsets sounds likes a recipe for mistakes...

Osama_Almalik
Fluorite | Level 6

Hi Reeza,

Actually i am dealing with a huge dataset.

Reeza
Super User

Well, we've told you this approach isn't the most efficient. Here's how you split data sets...

 

 

I don't think the %split macro can easily be modified, but there are many solutions out there to split BY groups. 

See the examples here or search on here for further examples.

 

https://blogs.sas.com/content/sasdummy/2015/01/26/how-to-split-one-data-set-into-many/

 

http://www.sascommunity.org/wiki/Split_Data_into_Subsets

 

Note that once you go this route you'll need to write macros to process everything so you'll need to brush up on that anyways. So writing this first macro will be a good practice exercise. 


Good Luck. 

Rick_SAS
SAS Super FREQ

When I hear "split into 10 subsamples," I tend to think about 10-fold cross-validation. But cross-validation would be done within each (simulated) sample, not across samples. Can you explain what problem you are trying to solve? What is the statistical question you are trying to answer?

Osama_Almalik
Fluorite | Level 6

Dear Rick,

i attached a simplified macro of what i am doing.

If you run the macro, you will get a dataset named "example_rick" with three columns: sim (simulation run, only two runs for simplification), factor: 50 levels per simulation run, and response per level of factor.

What i am trying is basicly:

-Per simulation run separately: create 5 sub-datasets each containing 10 factors, simply based on order. So the first subdatset will contain the factors, 1 to 10 the second subdataset factors 11 to 20, etc.

As mentioned earlier, the macro "split2" works fine in case you have only one simulation run. i would love to include a by statement in that macro "Split2", but i have not been able to do that.

Thanks a lot for your time.

Rick_SAS
SAS Super FREQ

Let me ask again. Can you explain what problem you are trying to solve? What is the statistical question you are trying to answer?  For example, "I am simulating an ANOVA model and I want to approximate the sampling distribution of the test statistics for the XYZ test under the assumption that....." 

Osama_Almalik
Fluorite | Level 6

Dear Rick,

Actually i am not at liberty to disclose that, and to be honest i do not see how that is relevant for this specific question.

But let's say this: For each splitted data set, i want to test whether there is a significant difference in response between the 10 factors, assuming the dataset can be expanded by adding repetitions per factor (e.g. 3 measurements per factor instead of 1).

Again, thanks for your time.

Rick_SAS
SAS Super FREQ

The relevance of my question is that sometimes people ask "How do I do [a complicated or inefficient thing] with the software?" when a better question might be "I want to solve the following problem: .... What is the best way to accomplish that with the software?" Sometimes we (myself included) try to solve a complex programming task because we are not aware that SAS provides a more elegant solution. I am always grateful when someone can steer me in a more productive direction.

 

I understand if you can't share details. Good luck in your analysis. If you are planning to do a lot of simulations in SAS, you might be interested in the book Simulating Data with SAS.

Osama_Almalik
Fluorite | Level 6

i can not share the details of my reseach. Hopefully someone else might have a solution for this problem.

Osama_Almalik
Fluorite | Level 6

For those who are interested: i managed to find some kind of solution by adjusting the macro "SPLIT2", see below. This results in an extra index for simulation run, beside the number of the split dataset of course. You later combine all using ARRAYME.

Thanks everybody for the effort.

 

%DO SIM = 1 %TO &SIMS;

DATA ANALYSIS_&SIM;
SET ANALYSIS;
WHERE SIM = &SIM;
RUN;

%END;

 

%macro split2(num,SIMS);
%DO SIM = 1 %TO &SIMS;
data _null_;
if 0 then set ANALYSIS_&SIM nobs=count;
call symput('numobs',put(count,8.));
run;
%let n=%sysevalf(&numobs/&num,ceil);
data %do J=1 %to &num ;
SAMPLE_&SIM&J %end; ;
set ANALYSIS_&SIM;
%do I=1 %to &num;
if %eval(&n*(&i-1)) <_n_ <= %eval(&n*&I)
then output SAMPLE_&SIM&I;
%end;
run;
%END;

%mend split2;

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 13 replies
  • 3043 views
  • 0 likes
  • 4 in conversation