BookmarkSubscribeRSS Feed
bobbeli
Fluorite | Level 6

Hi,

 

I encountered a really scary behavior when setting two dataset together and the data step is including if statement which contains variables not present in the other dataset.

 

Here is an example case with dummy data:

 

data one;
var1=1;
var2=.;
run;

 

data two;
var1=1;
output;
var1=2;
output;
run;

 

data three;
set one two;
if var1 ne . and var2 = . then var2=var1;
run;

 

As you can see the var2 will get the value retained in rows coming from data two in the final data three.

 

I am not a PDV expert, but my assumption here is that SAS reads the data in to PDV and populates the values. Then it realizes that ok we need to populate missing variable for this dataset since it is used in later if statement. It does it by creating an 'internal' retain statement for var2.

So the PDV would look like this for the dataset two: var1=1 var2 (retain) =.

but then the if statement is processed and the condition will change the value of var2=1.

Then the PDV would look like var1=1 var2(retain)=1.

Then in the next step the condition will never be triggered since the var2 is not missing anymore given the retain statement.

This then leads to a situation where all following var2 values are populated as 1.

 

I tried this with real data and by setting the 'data two' first row to have var1=. to test whether my hypothesis is true. It indeed seems so since then the first row will get missing var2 and second row will get the var2 populated with var1 and then that value is retained till the end.

 

For me this is seems like a extremely critical bug, which is also not giving any note to the logs. So saying that this is the way PDV processing is set to operate is just not enough.

 

I am running SAS EG 7.13 with SAS 9.4.5 model version 16.01

 

 

6 REPLIES 6
PeterClemmensen
Tourmaline | Level 20

I think the answer is in the bottom of When Variable Values Are Automatically Set to Missing by SAS.

 

When variables are read with a SET, MERGE, or UPDATE statement, SAS sets the values to missing only before the first iteration of the DATA step. (If you use a BY statement, the variable values are also set to missing when the  changes.) The variables retain their values until new values become available (for example, through an assignment statement or through the next execution of the SET, MERGE, or UPDATE statement). Variables created with options in the SET, MERGE, and UPDATE statements also retain their values from one iteration to the next.

bobbeli
Fluorite | Level 6

Thanks for your reply. Indeed seems so, but as said in my original post it is not 'enough'. This seems just awfully terrible design to begin with, and would need fixing. This behavior would not be trivial to understand even if reading the docs section you pointed out, the text describing it is gibberish to at least beginner programmers. SET statement documentation should also have mention about this in every possible section with bolded text.

Quentin
Super User

Hi,

 

Agree, it's not a bug, but it is a REALLY scary result of the way the DATA step works.  After 50 years of behaving that way (I assume), they're not going to be able to change it.  But I remember the day I discovered this issue, and it terrified me.  It was one of those days where my heart stopped and I thought "I'm a fraud, and every SAS program I've ever written might be wrong."  Luckily I hadn't written that many programs at the time.  : )

 

Even though I understand the behavior, I still have a 'rule' of generally not recoding the value of any pre-existing variables in a step that reads in multiple data sets.  It's a bit of an overkill rule, but it's a thought.  When I intentionally break the rule, I pay attention.

 

My example was similar to yours:

 

data a;
  input id y ;
cards;
1 1
2 1
;
run;

data b;
 input id flag;
cards;
3 0
4 1
5 0
;
run;

data c;
  set a b;
  *because Y is retained, y=9 for id=5 even though flag=0 ;
  if flag=1 then y=9 ;
  put (id flag y)(=) ;
run;
19   data c;
20     set a b;
21     *because Y is retained, y=9 for id=5 even though flag=0 ;
22     if flag=1 then y=9 ;
23     put (id flag y)(=) ;
24   run;

id=1 flag=. y=1
id=2 flag=. y=1
id=3 flag=0 y=.  <---- Good, confirms PDV was re-initialized when step switched to reading from work.b
id=4 flag=1 y=9
id=5 flag=0 y=9  <---- oops!!!!

Note that SAS does initialize the PDV when it starts reading values from work.b, which is nice.  So it is smart enough to know I don't want to retain values from work.a and have them spill down into work.b. 

 

In order to avoid this problem, the DATA step would have to initialize the PDV at the start of every iteration.  Which is just not how it was designed to work.  I suppose if SAS wanted to implement this, it would have to be a new alternative to the SET statement with did not set the RETAIN flag in the PDV,  say SETNORETAIN, which might be an interesting thought exercise as to whether that would induce more mistakes than it avoids.

 

I had a boss and mentor who heavily relied on PROC SQL, because with SQL it's safe to think about logical sets, you don't have to think about the iterative nature of the DATA step and PDV and implied retains.

BASUG is hosting free webinars Next up: Don Henderson presenting on using hash functions (not hash tables!) to segment data on June 12. Register now at the Boston Area SAS Users Group event page: https://www.basug.org/events.
Astounding
PROC Star

Also note ... the documentation you are referring to does show the underlying principles involved.  But it doesn't even mention the example that you are encountering in your programming.  It talks about a BY statement, but your program doesn't have a BY statement.  So when data sets are being concatenated, the re-initialize-to-missing step takes place when beginning to read a new data set.

Kurt_Bremser
Super User

This is well documented andintended behavior, and also well known by those who dig into the mechanics of the data step. It is surely not a bug.

Your special case can be (and is used) to set default values while appending data.

Do not expect any change to this behavior, as it might break existing codes, and SAS is quite good at avoiding this.

 

bobbeli
Fluorite | Level 6

Ok fair enough so the majority seems to think this as a desired property. Could someone then please give an example where this would be handy instead of buggy?

Would also argue that the documentation of this should be way better, with an example case in SET statement documentation (and also in other places).

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 6 replies
  • 620 views
  • 7 likes
  • 5 in conversation