## What happened when by statement is used in data step

Solved
Occasional Contributor
Posts: 9

# What happened when by statement is used in data step

data aaa;
aa = 1; b = 3; output;
run;

data ccc;
aa = 2; output;
aa = 3; output;
aa = 3; output;
aa = 3; output;
aa = 4; output;
aa = 5; output;
run;

data aab;
put _all_;
set aaa ccc;
/* by aa;*/
if aa = 3 then do;
b = 1;
b1 = 2;
end;
put _all_;
run;

As above, when I comment the by statement, b is retained as 1  when aa = 4, 5. However when I uncomment the by statement, the value of b becomes missing. I wonder what happended when by xxx is used with set statement?

By the way, if the set statement is replaced by merge statement, no matter whether commenting the by statement or not, the value of b never become 1 when aa = 4,5

Accepted Solutions
Solution
‎10-12-2016 02:33 PM
Super User
Posts: 6,751

## Re: What happened when by statement is used in data step

[ Edited ]

Consider an abbreviated version of your example:

data combined;

set aaa ccc;

by aa;

run;

AAA contains both AA and B. CCC contains AA only.

As the DATA step processes the observations, it alternates reading observations from AAA and CCC.  As part of that process, whenever it switches from one data set to the other, it reinitializes B to missing.  After that, if the next observation comes from AAA, it replaces B.  If the next observation comes from CCC, it does not replace B.

All Replies
Super User
Posts: 6,751

## Re: What happened when by statement is used in data step

[ Edited ]

You're looking at the effects of a few features.

Variables that come from a SAS data set are automatically retained.  That includes B, since it comes from AAA.  Without a BY statement, you set B to 1 and nothing replaces B for the rest of the DATA step.  So it remains 1 from that point forward. The software has to decide when to set variables to missing when they are brought in from a SAS data set, and does so whenever it switches from one data set to another.

You might be interested to compare that to what happens if you make a slight change to your program:

if aa=4 then do;

With a BY statement, the software has an additional function to perform.  Should it ever re-set retained variables to a missing value?  The answer depends on whether you use SET or MERGE.  With SET + BY, the software re-sets retained variables to missing when it begins reading observations from a new data set.  With MERGE + BY, the software re-sets retained variables to missing when it begins a new value of a BY variable.

Occasional Contributor
Posts: 9

## Re: What happened when by statement is used in data step

With SET + BY, the software re-sets retained variables to missing when it begins reading observations from a new data set.

Can you explain more about the new data set?  Very grateful.

Solution
‎10-12-2016 02:33 PM
Super User
Posts: 6,751

## Re: What happened when by statement is used in data step

[ Edited ]

Consider an abbreviated version of your example:

data combined;

set aaa ccc;

by aa;

run;

AAA contains both AA and B. CCC contains AA only.

As the DATA step processes the observations, it alternates reading observations from AAA and CCC.  As part of that process, whenever it switches from one data set to the other, it reinitializes B to missing.  After that, if the next observation comes from AAA, it replaces B.  If the next observation comes from CCC, it does not replace B.

Occasional Contributor
Posts: 9