Hi,
sometimes I see a code like this:
data want;
set have;
by some_variable
...
Is the "by" statement in the data step similar to the "group by" statement in proc sql? I.E, does it tell the data step to perform all the subsequent opertaions for all obervations having the same value of "some_variable" as if it were a subdata by itself, independently of other observations having other values of "some_variable"?
Thank you
/* You can get close to the SQL group by clause behavior with a
datastep using the do until(last.some_variable) statement */
data want;
mySum = 0;
do until(last.some_variable);
set have;
by some_variable;
mySum = mySum + x;
end;
output;
keep some_variable mySum;
run;
/* Note that you can refer to last.some_variable even before the
by statement that creates it */
/* The previous code is equivalent to: */
proc sql;
create table want as
select
some_variable,
sum(x) as mySum
from have
group by some_variable;
quit;
/* The datastep requires the dataset have to be sorted by some_variable */
No, not really.
In a data step, the "by x;" creates the "first.x" and "last.x" temporary variables, which you can incorporate into your conditional logic to get results by whatever x is.
So if I have the following data "have":
value | group_A | group_B |
1 | 1 | a |
2 | 1 | a |
3 | 1 | a |
4 | 1 | b |
5 | 1 | b |
6 | 1 | b |
7 | 2 | a |
8 | 2 | a |
9 | 2 | a |
10 | 2 | b |
11 | 2 | b |
12 | 2 | b |
If I do:
data want;
set have;
by group_A group_B
...
Then first.group_A will refer to the 1st and 7th observation, and last.group_B will refer to the 3rd, 6th, 9th and 12th observations?
That looks right. You can always test by creating permanent variables from the temporary to see what's going on:
first_x = first.x;
etc.
How you make use of them depends on what you want to do, but they're very useful.
/* You can get close to the SQL group by clause behavior with a
datastep using the do until(last.some_variable) statement */
data want;
mySum = 0;
do until(last.some_variable);
set have;
by some_variable;
mySum = mySum + x;
end;
output;
keep some_variable mySum;
run;
/* Note that you can refer to last.some_variable even before the
by statement that creates it */
/* The previous code is equivalent to: */
proc sql;
create table want as
select
some_variable,
sum(x) as mySum
from have
group by some_variable;
quit;
/* The datastep requires the dataset have to be sorted by some_variable */
Thanks PGStats for the explanation!
I think that with this post you also answered a question that I posted as a reply to Astounding's solution for the topic Keep Record Based on Value of a Prior or Proceeding Observation. Please let me know if that case is in fact a data step version of SQL By Group
Be aware that by group processing in a data step requires sorting of the input dataset. SQL does that on the fly on its own.
Using the notsorted option in the by statement might produce several separate summaries for each by group, while SQL will automatically summarize all instances of a by group into one.
eg
data have;
input byval $ value;
cards;
a 5
a 2
b 3
b 4
a 3
;
run;
proc sql;
create table want1 as
select byval, sum(value) as value
from have
group by byval
;
quit;
data want2;
set have (rename=(value=oldval));
by byval notsorted;
retain value;
if first.byval then value = 0;
value + oldval;
if last.byval then output;
drop oldval;
run;
proc print data=want1 noobs;
run;
proc print data=want2 noobs;
run;
ends with these results:
byval value a 10 b 7 byval value a 7 b 7 a 3
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.