data aaa;
aa = 1; b = 3; output;
run;
data ccc;
aa = 2; output;
aa = 3; output;
aa = 3; output;
aa = 3; output;
aa = 4; output;
aa = 5; output;
run;
data aab;
put _all_;
set aaa ccc;
/* by aa;*/
if aa = 3 then do;
b = 1;
b1 = 2;
end;
put _all_;
run;
As above, when I comment the by statement, b is retained as 1 when aa = 4, 5. However when I uncomment the by statement, the value of b becomes missing. I wonder what happended when by xxx is used with set statement?
By the way, if the set statement is replaced by merge statement, no matter whether commenting the by statement or not, the value of b never become 1 when aa = 4,5
Consider an abbreviated version of your example:
data combined;
set aaa ccc;
by aa;
run;
AAA contains both AA and B. CCC contains AA only.
As the DATA step processes the observations, it alternates reading observations from AAA and CCC. As part of that process, whenever it switches from one data set to the other, it reinitializes B to missing. After that, if the next observation comes from AAA, it replaces B. If the next observation comes from CCC, it does not replace B.
You're looking at the effects of a few features.
Variables that come from a SAS data set are automatically retained. That includes B, since it comes from AAA. Without a BY statement, you set B to 1 and nothing replaces B for the rest of the DATA step. So it remains 1 from that point forward. The software has to decide when to set variables to missing when they are brought in from a SAS data set, and does so whenever it switches from one data set to another.
You might be interested to compare that to what happens if you make a slight change to your program:
if aa=4 then do;
With a BY statement, the software has an additional function to perform. Should it ever re-set retained variables to a missing value? The answer depends on whether you use SET or MERGE. With SET + BY, the software re-sets retained variables to missing when it begins reading observations from a new data set. With MERGE + BY, the software re-sets retained variables to missing when it begins a new value of a BY variable.
With SET + BY, the software re-sets retained variables to missing when it begins reading observations from a new data set.
Can you explain more about the new data set? Very grateful.
Consider an abbreviated version of your example:
data combined;
set aaa ccc;
by aa;
run;
AAA contains both AA and B. CCC contains AA only.
As the DATA step processes the observations, it alternates reading observations from AAA and CCC. As part of that process, whenever it switches from one data set to the other, it reinitializes B to missing. After that, if the next observation comes from AAA, it replaces B. If the next observation comes from CCC, it does not replace B.
You answered my question perfectly ! Thanks, Astounding.
Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.
Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.