DATA Step, Macro, Functions and more

data step set by

Accepted Solution Solved
Reply
Super Contributor
Posts: 413
Accepted Solution

data step set by

Hi,

 

sometimes I see a code like this:

 

data want;

set have;

by some_variable

...

 

Is the "by" statement in the data step similar to the "group by" statement in proc sql? I.E, does it tell the data step to perform all the subsequent opertaions for all obervations having the same value of "some_variable" as if it were a subdata by itself, independently of other observations having other values of "some_variable"?

 

Thank you 


Accepted Solutions
Solution
‎03-28-2017 12:33 PM
Respected Advisor
Posts: 4,644

Re: data step set by

/* You can get close to the SQL group by clause behavior with a 
 datastep using the do until(last.some_variable) statement */
 
data want;
mySum = 0;
do until(last.some_variable); 
	set have; 
	by some_variable;
	mySum = mySum + x;
	end;
output;
keep some_variable mySum;
run;

/* Note that you can refer to last.some_variable even before the 
 by statement that creates it */
 
 /* The previous code is equivalent to: */
 proc sql;
 create table want as
 select 
	some_variable, 
	sum(x) as mySum
from have
group by some_variable;
quit;

/* The datastep requires the dataset have to be sorted by some_variable */
PG

View solution in original post


All Replies
PROC Star
Posts: 288

Re: data step set by

No, not really.

 

In a data step, the "by x;" creates the "first.x" and "last.x" temporary variables, which you can incorporate into your conditional logic to get results by whatever x is.

Super Contributor
Posts: 413

Re: data step set by

So if I have the following data "have":

value group_A group_B
1 1 a
2 1 a
3 1 a
4 1 b
5 1 b
6 1 b
7 2 a
8 2 a
9 2 a
10 2 b
11 2 b
12 2 b

 

If I do:

 

data want;

set have;

by group_A group_B

...

 

Then first.group_A will refer to the 1st and 7th observation, and last.group_B will refer to the 3rd, 6th, 9th and 12th observations?

PROC Star
Posts: 288

Re: data step set by

That looks right. You can always test by creating permanent variables from the temporary to see what's going on:

 

first_x = first.x;

 

etc.

 

How you make use of them depends on what you want to do, but they're very useful.

Contributor
Posts: 56

Re: data step set by

if you have to retrieve some other calculations depend on group by observations we have to use by statement in datastep. In datastep we can use first.variables and last.variables for further calculations.
Solution
‎03-28-2017 12:33 PM
Respected Advisor
Posts: 4,644

Re: data step set by

/* You can get close to the SQL group by clause behavior with a 
 datastep using the do until(last.some_variable) statement */
 
data want;
mySum = 0;
do until(last.some_variable); 
	set have; 
	by some_variable;
	mySum = mySum + x;
	end;
output;
keep some_variable mySum;
run;

/* Note that you can refer to last.some_variable even before the 
 by statement that creates it */
 
 /* The previous code is equivalent to: */
 proc sql;
 create table want as
 select 
	some_variable, 
	sum(x) as mySum
from have
group by some_variable;
quit;

/* The datastep requires the dataset have to be sorted by some_variable */
PG
Super Contributor
Posts: 413

Re: data step set by

Thanks PGStats for the explanation!

 

I think that with this post you also answered a question that I posted as a reply to Astounding's solution for the topic Keep Record Based on Value of a Prior or Proceeding Observation. Please let me know if that case is in fact a data step version of SQL By GroupSmiley Surprised

Super User
Posts: 6,936

Re: data step set by

Be aware that by group processing in a data step requires sorting of the input dataset. SQL does that on the fly on its own.

 

Using the notsorted option in the by statement might produce several separate summaries for each by group, while SQL will automatically summarize all instances of a by group into one.

 

eg

data have;
input byval $ value;
cards;
a 5
a 2
b 3
b 4
a 3
;
run;

proc sql;
create table want1 as
select byval, sum(value) as value
from have
group by byval
;
quit;

data want2;
set have (rename=(value=oldval));
by byval notsorted;
retain value;
if first.byval then value = 0;
value + oldval;
if last.byval then output;
drop oldval;
run;

proc print data=want1 noobs;
run;

proc print data=want2 noobs;
run;

ends with these results:

byval    value

  a        10 
  b         7 
              

byval    value

  a        7  
  b        7  
  a        3  
---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 7 replies
  • 198 views
  • 1 like
  • 5 in conversation