BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
_el_doredo
Quartz | Level 8

Hello Experts,

 

I need to split my larger data set. I know how many datasets i want from the larger data set.

 

I want to split my larger data set into 3 data set with equal count(if total observation is completely divisible by 3) or else higher observation count in first data set

 

Using this code i am achieving the result.

 

%macro temp(dataset=,noofsplits=);

data %do i=1 %to &noofsplits.;
split&i. %end;;
retain x;
set &dataset. nobs=nobs;

if _n_ eq 1 then do;
if mod(nobs,&noofsplits.) eq 0 then
x=int(nobs/&noofsplits.);
else x=int(nobs/&noofsplits.)+1;
end;

%do i=1 %to &noofsplits;
%if &i. > 1 %then %do;
else %end;

if _n_ le (&i.*x) then output split&i.;
%end;
run;
%mend temp;
%temp(dataset=sashelp.cars , noofsplits=3);

 

SASHELP.CARS has 428 observations. So i am getting 143 in first data set and 143 in second data set and 142 in third data set. But, this part of code i am not understanding clearly %if &i. > 1 %then %do; else %end;

If we remove these lines from my code means totally it collapse(143 in first data set and 286 in second data set and 428 in third data set). It's clear that it used to start the observation from the next obs of completed data set. I am not getting how it is done.

 

It will be very helpful if some one clarify this for me and if there is any easier way to achieve the same without hard coding total data set needed means most welcome

 

Thanks in Advance

1 ACCEPTED SOLUTION

Accepted Solutions
Quentin
Super User

This bit of code:

%if &i. > 1 %then %do; else %end;

Is generating the SAS code: else.

 

If you remove that bit code, instead of the macro generating:

 

if _n_ le (1*x) then output split1; /* (when &i=1) */
else if _n_ le (2*x) then output split2; /* (when &i=2) */
else if _n_ le (3*x) then output split3; /* (when &i=3) */

it will generate:

if _n_ le (1*x) then output split1; /* (when &i=1) */
if _n_ le (2*x) then output split2; /* (when &i=2) */
if _n_ le (3*x) then output split3; /* (when &i=3) */

That will give you the wrong result. 

 

The easiest way to see this is to run the code with MPRINT turned on, so you can see the generated code in the log.

BASUG is hosting free webinars Next up: Don Henderson presenting on using hash functions (not hash tables!) to segment data on June 12. Register now at the Boston Area SAS Users Group event page: https://www.basug.org/events.

View solution in original post

11 REPLIES 11
PaigeMiller
Diamond | Level 26

You can just write out the code generated by this line, or have SAS do it by turning on macro debugging commands. Run this line of code, then run the macro again, and you will see in the log exactly the code generated.

 

options mprint;

 

It turns out that this will be the code generated:

 

if _n_ le (1*x) then output split1; /* (when &i=1) */
else if _n_ le (2*x) then output split2; /* (when &i=2) */
else if _n_ le (3*x) then output split3; /* (when &i=3) */

 

I usually question why this is necessary, in fact I usually discourage people from doing this as in most cases there is little or no benefit (although there are exceptions, of course). Why are you doing this? Why can't you have just one data set with the numbers 1 through 3 stored in a new variable?

--
Paige Miller
_el_doredo
Quartz | Level 8

Hello,

I have tried that. I understood that. when &i value equals 2 how it know to start from observation 144. That's what i am not understanding

If &i value is 1, this statement fails %if &i. > 1 %then %do; else %end; But, still Split1 displays with 143 observation
If &i value is 2 means how it knows to start from observation 144( as 143 observations sent into split1)

This was my concern and can you please send me the way which you are suggesting

PaigeMiller
Diamond | Level 26

This part of the macro always executes, it is not conditional. Since the loop goes from 1 to 3, it is executed three times.

 

if _n_ le (&i.*x) then output split&i.;

 

This part of the macro code only executes when &i is greater than 1

 

else 

 

The loop goes 1 to 3, so it only executed for &i=2 and &i=3. So you get the resulting code I showed.

 

You also asked for simpler approaches. I usually question why this splitting a data set in this fashion is necessary, in fact I usually discourage people from doing this as in most cases there is little or no benefit (although there are exceptions, of course). Why are you doing this? Why can't you have just one data set with the numbers 1 through 3 stored in a new variable? Wouldn't that be simpler?

 

--
Paige Miller
_el_doredo
Quartz | Level 8
If i remove this code(%if &i. > 1 %then %do; else %end;) means then why i am not getting the correct result.

The method which you suggest is we have to hard code it. i am creating a variable called temp to store 1 to 3. then my code will be like this
if temp=1 then output split1
else if temp=2 then output split2
else if temp=3 then output split3

But,Suppose in future if i want to create 10 datasets means i need to write more lines. So, i dont want to hard code the value here.
PaigeMiller
Diamond | Level 26

@_el_doredo wrote:
If i remove this code(%if &i. > 1 %then %do; else %end;) means then why i am not getting the correct result.

Because you have modified the logic in the code, you get different results. You should not get the same results if you change the logic. There is no ELSE now in the code, and so different things happen. Take the three lines of code I wrote in my first reply, and look at them in the ELSE is now gone. What do you think will happen when that code executes with no ELSE in the code?

 

The method which you suggest is we have to hard code it. i am creating a variable called temp to store 1 to 3. then my code will be like this
if temp=1 then output split1
else if temp=2 then output split2
else if temp=3 then output split3

But,Suppose in future if i want to create 10 datasets means i need to write more lines. So, i dont want to hard code the value here.


Absolutely not. There is no hard-coding of anything (other than the number 3 which is a macro argument) anywhere in what I am suggesting. But none of this explains why you need separate data sets. What can you do with separate data sets that you can't do with a single data set that has the split number in a variable?

--
Paige Miller
_el_doredo
Quartz | Level 8
As you know most of our daily works we don't need this. But, For interview purpose they are asking these kinds of questions so i want to be sure on that. I don't want to be in a position where i don't know some thing at this level.

Thank you so much for explaining. I will try the method which you suggest
PaigeMiller
Diamond | Level 26

@_el_doredo wrote:
As you know most of our daily works we don't need this. But, For interview purpose they are asking these kinds of questions so i want to be sure on that. I don't want to be in a position where i don't know some thing at this level.

If an interviewer asks a question about splitting data sets into many smaller data sets, then the first answer should be that this is rarely necessary in SAS, which has much better tools for dealing with "groups" of data points, much better than splitting the data set into many smaller parts. Those tools are the BY statement and the WHERE statement. And so, in answering the interviewer's question, you should say that you would see if those tools would be a better solution for the specific problem.

--
Paige Miller
Quentin
Super User

This bit of code:

%if &i. > 1 %then %do; else %end;

Is generating the SAS code: else.

 

If you remove that bit code, instead of the macro generating:

 

if _n_ le (1*x) then output split1; /* (when &i=1) */
else if _n_ le (2*x) then output split2; /* (when &i=2) */
else if _n_ le (3*x) then output split3; /* (when &i=3) */

it will generate:

if _n_ le (1*x) then output split1; /* (when &i=1) */
if _n_ le (2*x) then output split2; /* (when &i=2) */
if _n_ le (3*x) then output split3; /* (when &i=3) */

That will give you the wrong result. 

 

The easiest way to see this is to run the code with MPRINT turned on, so you can see the generated code in the log.

BASUG is hosting free webinars Next up: Don Henderson presenting on using hash functions (not hash tables!) to segment data on June 12. Register now at the Boston Area SAS Users Group event page: https://www.basug.org/events.
_el_doredo
Quartz | Level 8
Thanks a lot
Astounding
PROC Star

Just to muddy the waters a little bit ...

 

The macro you use takes the first 142 observations and puts them in the first split.  Would there be any advantage to a slightly different approach?  For example, take observations 1, 4, 7, 10, etc. for the first split, observations 2, 5, 8, 11, etc. for the second split?  Or even a random assignment?

Kurt_Bremser
Super User

What about this?

%macro temp(dataset=,noofsplits=);
data
%do i = 1 %to &noofsplits.;
  split&i.
%end;
;
set &dataset.;
select (mod(_n_,&noofsplits.));
%do i = 1 %to &noofsplits.;
  when (%eval(&i. - 1)) output split&i.;
%end;
end;
run;
%mend temp;

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

SAS Enterprise Guide vs. SAS Studio

What’s the difference between SAS Enterprise Guide and SAS Studio? How are they similar? Just ask SAS’ Danny Modlin.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 11 replies
  • 969 views
  • 1 like
  • 5 in conversation