Solved: data step set by

ilikesas · Posted 03-27-2017 07:03 PM

Hi,

sometimes I see a code like this:

data want;

set have;

by some_variable

...

Is the "by" statement in the data step similar to the "group by" statement in proc sql? I.E, does it tell the data step to perform all the subsequent opertaions for all obervations having the same value of "some_variable" as if it were a subdata by itself, independently of other observations having other values of "some_variable"?

Thank you

PGStats · Posted 03-28-2017 12:31 AM

/* You can get close to the SQL group by clause behavior with a 
 datastep using the do until(last.some_variable) statement */
 
data want;
mySum = 0;
do until(last.some_variable); 
	set have; 
	by some_variable;
	mySum = mySum + x;
	end;
output;
keep some_variable mySum;
run;

/* Note that you can refer to last.some_variable even before the 
 by statement that creates it */
 
 /* The previous code is equivalent to: */
 proc sql;
 create table want as
 select 
	some_variable, 
	sum(x) as mySum
from have
group by some_variable;
quit;

/* The datastep requires the dataset have to be sorted by some_variable */

PG

View solution in original post

collinelliot · Posted 03-27-2017 07:07 PM

No, not really.

In a data step, the "by x;" creates the "first.x" and "last.x" temporary variables, which you can incorporate into your conditional logic to get results by whatever x is.

ilikesas · Posted 03-27-2017 07:33 PM

So if I have the following data "have":

value	group_A	group_B
1	1	a
2	1	a
3	1	a
4	1	b
5	1	b
6	1	b
7	2	a
8	2	a
9	2	a
10	2	b
11	2	b
12	2	b

If I do:

data want;

set have;

by group_A group_B

...

Then first.group_A will refer to the 1st and 7th observation, and last.group_B will refer to the 3rd, 6th, 9th and 12th observations?

collinelliot · Posted 03-27-2017 07:36 PM

That looks right. You can always test by creating permanent variables from the temporary to see what's going on:

first_x = first.x;

etc.

How you make use of them depends on what you want to do, but they're very useful.

lakshmi_74 · Posted 03-27-2017 07:38 PM

if you have to retrieve some other calculations depend on group by observations we have to use by statement in datastep. In datastep we can use first.variables and last.variables for further calculations.

PGStats · Posted 03-28-2017 12:31 AM

/* You can get close to the SQL group by clause behavior with a 
 datastep using the do until(last.some_variable) statement */
 
data want;
mySum = 0;
do until(last.some_variable); 
	set have; 
	by some_variable;
	mySum = mySum + x;
	end;
output;
keep some_variable mySum;
run;

/* Note that you can refer to last.some_variable even before the 
 by statement that creates it */
 
 /* The previous code is equivalent to: */
 proc sql;
 create table want as
 select 
	some_variable, 
	sum(x) as mySum
from have
group by some_variable;
quit;

/* The datastep requires the dataset have to be sorted by some_variable */

PG

ilikesas · Posted 03-28-2017 12:44 AM

Thanks PGStats for the explanation!

I think that with this post you also answered a question that I posted as a reply to Astounding's solution for the topic Keep Record Based on Value of a Prior or Proceeding Observation. Please let me know if that case is in fact a data step version of SQL By Group

Kurt_Bremser · Posted 03-28-2017 02:46 AM

Be aware that by group processing in a data step requires sorting of the input dataset. SQL does that on the fly on its own.

Using the notsorted option in the by statement might produce several separate summaries for each by group, while SQL will automatically summarize all instances of a by group into one.

eg

data have;
input byval $ value;
cards;
a 5
a 2
b 3
b 4
a 3
;
run;

proc sql;
create table want1 as
select byval, sum(value) as value
from have
group by byval
;
quit;

data want2;
set have (rename=(value=oldval));
by byval notsorted;
retain value;
if first.byval then value = 0;
value + oldval;
if last.byval then output;
drop oldval;
run;

proc print data=want1 noobs;
run;

proc print data=want2 noobs;
run;

ends with these results:

byval    value

  a        10 
  b         7 
              

byval    value

  a        7  
  b        7  
  a        3

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

data step set by

Re: data step set by

Re: data step set by

Re: data step set by

Re: data step set by

Re: data step set by

Re: data step set by

Re: data step set by

Re: data step set by

Registration is open

SAS Training: Just a Click Away