BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
ilikesas
Barite | Level 11

Hi,

 

sometimes I see a code like this:

 

data want;

set have;

by some_variable

...

 

Is the "by" statement in the data step similar to the "group by" statement in proc sql? I.E, does it tell the data step to perform all the subsequent opertaions for all obervations having the same value of "some_variable" as if it were a subdata by itself, independently of other observations having other values of "some_variable"?

 

Thank you 

1 ACCEPTED SOLUTION

Accepted Solutions
PGStats
Opal | Level 21
/* You can get close to the SQL group by clause behavior with a 
 datastep using the do until(last.some_variable) statement */
 
data want;
mySum = 0;
do until(last.some_variable); 
	set have; 
	by some_variable;
	mySum = mySum + x;
	end;
output;
keep some_variable mySum;
run;

/* Note that you can refer to last.some_variable even before the 
 by statement that creates it */
 
 /* The previous code is equivalent to: */
 proc sql;
 create table want as
 select 
	some_variable, 
	sum(x) as mySum
from have
group by some_variable;
quit;

/* The datastep requires the dataset have to be sorted by some_variable */
PG

View solution in original post

7 REPLIES 7
collinelliot
Barite | Level 11

No, not really.

 

In a data step, the "by x;" creates the "first.x" and "last.x" temporary variables, which you can incorporate into your conditional logic to get results by whatever x is.

ilikesas
Barite | Level 11

So if I have the following data "have":

value group_A group_B
1 1 a
2 1 a
3 1 a
4 1 b
5 1 b
6 1 b
7 2 a
8 2 a
9 2 a
10 2 b
11 2 b
12 2 b

 

If I do:

 

data want;

set have;

by group_A group_B

...

 

Then first.group_A will refer to the 1st and 7th observation, and last.group_B will refer to the 3rd, 6th, 9th and 12th observations?

collinelliot
Barite | Level 11

That looks right. You can always test by creating permanent variables from the temporary to see what's going on:

 

first_x = first.x;

 

etc.

 

How you make use of them depends on what you want to do, but they're very useful.

lakshmi_74
Quartz | Level 8
if you have to retrieve some other calculations depend on group by observations we have to use by statement in datastep. In datastep we can use first.variables and last.variables for further calculations.
PGStats
Opal | Level 21
/* You can get close to the SQL group by clause behavior with a 
 datastep using the do until(last.some_variable) statement */
 
data want;
mySum = 0;
do until(last.some_variable); 
	set have; 
	by some_variable;
	mySum = mySum + x;
	end;
output;
keep some_variable mySum;
run;

/* Note that you can refer to last.some_variable even before the 
 by statement that creates it */
 
 /* The previous code is equivalent to: */
 proc sql;
 create table want as
 select 
	some_variable, 
	sum(x) as mySum
from have
group by some_variable;
quit;

/* The datastep requires the dataset have to be sorted by some_variable */
PG
ilikesas
Barite | Level 11

Thanks PGStats for the explanation!

 

I think that with this post you also answered a question that I posted as a reply to Astounding's solution for the topic Keep Record Based on Value of a Prior or Proceeding Observation. Please let me know if that case is in fact a data step version of SQL By GroupSmiley Surprised

Kurt_Bremser
Super User

Be aware that by group processing in a data step requires sorting of the input dataset. SQL does that on the fly on its own.

 

Using the notsorted option in the by statement might produce several separate summaries for each by group, while SQL will automatically summarize all instances of a by group into one.

 

eg

data have;
input byval $ value;
cards;
a 5
a 2
b 3
b 4
a 3
;
run;

proc sql;
create table want1 as
select byval, sum(value) as value
from have
group by byval
;
quit;

data want2;
set have (rename=(value=oldval));
by byval notsorted;
retain value;
if first.byval then value = 0;
value + oldval;
if last.byval then output;
drop oldval;
run;

proc print data=want1 noobs;
run;

proc print data=want2 noobs;
run;

ends with these results:

byval    value

  a        10 
  b         7 
              

byval    value

  a        7  
  b        7  
  a        3  

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 7 replies
  • 1450 views
  • 1 like
  • 5 in conversation