We have encountered a strange behaviour in the data step using a SET statement with multiple datasets.
The situation is as follows:
data ds1;
x = 1;
y = 1;
output;
x = 2;
y = .;
output;
x = 3;
y = 3;
output;
run;
data ds2;
x = 4;
output;
x = 5;
output;
x = 6;
output;
run;
ds1 has two columns x and y whereas ds only has the x column.
In the next step, I want to concatenate the two datasets into one and replace missing values in column y with values from x.
data ds_combined;
set ds1 ds2;
if missing(y) then
y = x;
run;
The result looks like this
x y 1 1 2 2 3 3 4 4 5 4 6 4
which is not what I expected.
I narrowed the issue down to the fact that the y column is in ds1 but not ds2. Everything works as expected
My understanding was that I can use multiple datasets in one SET statement and process the rows as if they come from just one dataset. Apperently this is not the case. Can anyone explain why this happens? Thanks in advance!
You don't clearly state what you WANT the output to look like. But if the goal is to UPDATE missing values in one dataset with the values from another, you may want to look at the UPDATE statement. And for that, your data sets really should have an ID so you can tell WHICH row needs updating. Try something like this, maybe?
data ds1;
ID+1;
x = 1;
y = 1;
output;
ID+1;
x = 2;
y = .;
output;
ID+1;
x = 3;
y = 3;
output;
run;
/* Create an update dataset for those rows with missing values */
data ds2;
set ds1;
where y is missing;
y=x;
run;
/* Update the missing valuse */
data ds_combined;
update ds1 ds2;
by ID;
run;
I wouldn't say it is a severe pitfall. If one understands the behaviour it's natural way to behave. 🙂
Bart
Hi,
if you add some diagnostic to the data step:
data ds_combined;
put "1) " _ALL_;
set ds1 ds2 indsname=i;
put "2) " _ALL_;
if missing(y) then
y = x;
put "3) " _ALL_;
run;
you will see in the log:
1) i= x=. y=. _ERROR_=0 _N_=1 2) i=WORK.DS1 x=1 y=1 _ERROR_=0 _N_=1 3) i=WORK.DS1 x=1 y=1 _ERROR_=0 _N_=1 1) i=WORK.DS1 x=1 y=1 _ERROR_=0 _N_=2 2) i=WORK.DS1 x=2 y=. _ERROR_=0 _N_=2 3) i=WORK.DS1 x=2 y=2 _ERROR_=0 _N_=2 1) i=WORK.DS1 x=2 y=2 _ERROR_=0 _N_=3 2) i=WORK.DS1 x=3 y=3 _ERROR_=0 _N_=3 3) i=WORK.DS1 x=3 y=3 _ERROR_=0 _N_=3 1) i=WORK.DS1 x=3 y=3 _ERROR_=0 _N_=4 2) i=WORK.DS2 x=4 y=. _ERROR_=0 _N_=4 3) i=WORK.DS2 x=4 y=4 _ERROR_=0 _N_=4 1) i=WORK.DS2 x=4 y=4 _ERROR_=0 _N_=5 2) i=WORK.DS2 x=5 y=4 _ERROR_=0 _N_=5 3) i=WORK.DS2 x=5 y=4 _ERROR_=0 _N_=5 1) i=WORK.DS2 x=5 y=4 _ERROR_=0 _N_=6 2) i=WORK.DS2 x=6 y=4 _ERROR_=0 _N_=6 3) i=WORK.DS2 x=6 y=4 _ERROR_=0 _N_=6 1) i=WORK.DS2 x=6 y=4 _ERROR_=0 _N_=7
Up to _N_ = 3 everything is ok.
In the "2)" for _N_ = 4 so the first place where values form DS2 were used y was set to missing. Since it was missing the IF condition was true so value was replaced by 4. Since Y came from a SAS dataset its value is automatically retained (see "1)" for _N_=5). Also since DS2 doesn't have y as a variable walues of y aren't overwritten.
Reversing:
set ds2 ds1;
will show you similar effect.
All the best
Bart
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.