Solved: SET Statement with multiple datasets and missing columns

msauer · Posted 02-04-2022 07:52 AM

We have encountered a strange behaviour in the data step using a SET statement with multiple datasets.

The situation is as follows:

data ds1;
  x = 1;
  y = 1;
  output;
  x = 2;
  y = .;
  output;
  x = 3;
  y = 3;
  output;
run;

data ds2;
  x = 4;
  output;
  x = 5;
  output;
  x = 6;
  output;
run;

ds1 has two columns x and y whereas ds only has the x column.

In the next step, I want to concatenate the two datasets into one and replace missing values in column y with values from x.

data ds_combined;
  set ds1 ds2;
  if missing(y) then
    y = x;
run;

The result looks like this

which is not what I expected.

I narrowed the issue down to the fact that the y column is in ds1 but not ds2. Everything works as expected

if y is set to missing explicitly in ds2 or
if concatenation and replacement of missing values happen in two different data steps.

My understanding was that I can use multiple datasets in one SET statement and process the rows as if they come from just one dataset. Apperently this is not the case. Can anyone explain why this happens? Thanks in advance!

Oligolas · Posted 02-04-2022 08:27 AM

Hi,

yes it's one of the severe pitfalls of SAS.

SAS describes it as 'automatic retain' here

You'll find an explanation there

________________________
- Cheers -

View solution in original post

SASJedi · Posted 02-04-2022 08:20 AM

You don't clearly state what you WANT the output to look like. But if the goal is to UPDATE missing values in one dataset with the values from another, you may want to look at the UPDATE statement. And for that, your data sets really should have an ID so you can tell WHICH row needs updating. Try something like this, maybe?

data ds1;
  ID+1;
  x = 1;
  y = 1;
  output;
  ID+1;
  x = 2;
  y = .;
  output;
  ID+1;
  x = 3;
  y = 3;
  output;
run;

/* Create an update dataset for those rows with missing values */
data ds2;
   set ds1;
   where y is missing;
   y=x;
run;

/* Update the missing valuse */
data ds_combined;
   update ds1 ds2;
   by ID;
run;

Check out my Jedi SAS Tricks for SAS Users

Oligolas · Posted 02-04-2022 08:27 AM

Hi,

yes it's one of the severe pitfalls of SAS.

SAS describes it as 'automatic retain' here

You'll find an explanation there

________________________
- Cheers -

yabwon · Posted 02-04-2022 08:45 AM

I wouldn't say it is a severe pitfall. If one understands the behaviour it's natural way to behave. 🙂

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

yabwon · Posted 02-04-2022 08:42 AM

Hi,

if you add some diagnostic to the data step:

data ds_combined;
  put "1) " _ALL_;
  set ds1 ds2 indsname=i;

  put "2) " _ALL_;

  if missing(y) then
    y = x;

  put "3) " _ALL_;
run;

you will see in the log:

1) i=  x=. y=. _ERROR_=0 _N_=1
2) i=WORK.DS1 x=1 y=1 _ERROR_=0 _N_=1
3) i=WORK.DS1 x=1 y=1 _ERROR_=0 _N_=1
1) i=WORK.DS1 x=1 y=1 _ERROR_=0 _N_=2
2) i=WORK.DS1 x=2 y=. _ERROR_=0 _N_=2
3) i=WORK.DS1 x=2 y=2 _ERROR_=0 _N_=2
1) i=WORK.DS1 x=2 y=2 _ERROR_=0 _N_=3
2) i=WORK.DS1 x=3 y=3 _ERROR_=0 _N_=3
3) i=WORK.DS1 x=3 y=3 _ERROR_=0 _N_=3
1) i=WORK.DS1 x=3 y=3 _ERROR_=0 _N_=4
2) i=WORK.DS2 x=4 y=. _ERROR_=0 _N_=4
3) i=WORK.DS2 x=4 y=4 _ERROR_=0 _N_=4
1) i=WORK.DS2 x=4 y=4 _ERROR_=0 _N_=5
2) i=WORK.DS2 x=5 y=4 _ERROR_=0 _N_=5
3) i=WORK.DS2 x=5 y=4 _ERROR_=0 _N_=5
1) i=WORK.DS2 x=5 y=4 _ERROR_=0 _N_=6
2) i=WORK.DS2 x=6 y=4 _ERROR_=0 _N_=6
3) i=WORK.DS2 x=6 y=4 _ERROR_=0 _N_=6
1) i=WORK.DS2 x=6 y=4 _ERROR_=0 _N_=7

Up to _N_ = 3 everything is ok.

In the "2)" for _N_ = 4 so the first place where values form DS2 were used y was set to missing. Since it was missing the IF condition was true so value was replaced by 4. Since Y came from a SAS dataset its value is automatically retained (see "1)" for _N_=5). Also since DS2 doesn't have y as a variable walues of y aren't overwritten.

Reversing:

set ds2 ds1;

will show you similar effect.

All the best

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

SET Statement with multiple datasets and missing columns

Re: SET Statement with multiple datasets and missing columns

Re: SET Statement with multiple datasets and missing columns

Re: SET Statement with multiple datasets and missing columns

Re: SET Statement with multiple datasets and missing columns

Re: SET Statement with multiple datasets and missing columns

SET Statement with multiple datasets and missing columns

Re: SET Statement with multiple datasets and missing columns

Re: SET Statement with multiple datasets and missing columns

Re: SET Statement with multiple datasets and missing columns

Re: SET Statement with multiple datasets and missing columns

Re: SET Statement with multiple datasets and missing columns

Registration is open

SAS Training: Just a Click Away