Solved: Set statement with tacit retain?

Kastchei · Posted 02-16-2017 12:21 PM

Hi all,

I don't think I'm understanding the set statement correctly when dealing with multiple set datasets. The issue seems to appear when a column that only exists in one of the datasets is reassigned a value. Here is what seems like a simple example.

data a;
	infile datalines truncOver;
	input v;
	datalines;
1
2
 
 
 
3
 
4
5
;
run;

data b;
	input w;
	wCopy = w;
	datalines;
1
2
3
4
5
;
run;

data AB;
 set a b;
 w2 = w;

 w = v;
run;

Variable v comes from dataset A only, and varaibles w and wCopy are identical and come from dataset B. In the resultant dataset AB, column w seems as expected, because we assigned w = v. Thus, w and v are identical now with the same values that came from v. The variable wCopy shows what w used to look like and is just included for reference.

It's variable w2 that seems quite odd. My expectation is that w2 would also be identical to wCopy, because we are setting w2 to the value of w before we are changing the value of w to equal v. However, this is only the case for the records coming from dataset B, where w was already defined. For the records in dataset A, where w did not exist, SAS seems to be retaining values of w from one record to the next. For the first record, w does not exist, so w2 is missing. Then w is set to v, so w for the first record now has a value of 1. However, for the second record, I would also expect w not to exist, but it appears that w actually now exists, the value 1 retained from the previous record. Since w = 1 (retained), now w2 is set to 1 and w is changed to v (2). For the third record, the retained w value is now 2, etc. This appears like a lag() function, but it seems to actually be retain, as seen in the second example.

Here is example two.

data a;
	infile datalines truncOver;
	input v;
	datalines;
1
2
 
 
 
3
 
4
5
;
run;

data b;
	format wCopy;
	input w;
	x     = w;
	wCopy = w;
	datalines;
1
2
3
4
5
;
run;

data ab2;
	set a b;

	if w = .
		then w = v;

	if v ^= .
		then x = v;

	if v ^= .
		then y = v;
		else y = w;

	if v ^= .
		then z = v;
		else z = wCopy;
run;

Variable v is untouched, and wCopy is the original values of w and x, which were identical in dataset B. v and wCopy appear just as they would if the data step only had the statement set A B. I can only make sense of the resultant dataset AB2 if there is somehow a tacit retain statement on w. Column z is the only one that acts like I would intuitively expect.

Variable w itself gets assigned to v only when w is missing. I would expect the results to be column z, essentially "patching" variable w with variable v; i.e. when w is missing, set it to v. Instead, it seems to work correctly only for the first record. w is missing, so it gets assigned the value of v (1). However, it appears that this value of 1 is then retained for all other records coming from dataset A. Thus for every other record from A, the condition w = . is false, because w has already been set to 1.

Variable x I would expect to just be replacing variable x with variable v, but only when v was not missing (essentially patching v with x). Again, this phantom retain statement seems to strike. For the records where v exists, all is as expected. But the now assigned value for x is getting retained to subsequent records. So even though I would expect a row with v missing to leave x untouched (essentially missing), x is getting the value retained from above.

Variable y is similar to x, except the else statement explicitly says to keep the value of w, but since w had been retained, those values are now patched into y.

Variable z is the only intuitive result. It equals v only when that if statement is met. Otherwise, it equals the original wCopy variable, which includes missing values.

I think I might understand, but could someone verify that this is correct? Say a SET statement has more than one dataset. One of these datasets, Q, is missing a column that the others have. SAS does not simply set this column to missing/null for the records coming from Q. Instead, SAS determines the value of this column for the first record from Q, and then retains that value all the way until the last record of Q. Usually it seems like all missings, because the first record is determined to be missing and then that is retained down the column.

Sorry for being so long-winded, but this really threw my whole SAS group for a loop this morning, and the behavior was very not expected.

Thanks!

Michael

mkeintz · Posted 02-16-2017 01:35 PM

Yes, some variables are always tacitly retained (but even that statement is incomplete).

The program data vector (PDV) can be subdivided into two kinds of variables:

New vars - i.e. vars newly defined within the data step. In your case W2 is such a variable. New vars (in the absence of a retain statement) are reset to missing at the beginning of each iteration of the data step.
"Old" vars - i.e. var read in via a SET (or MERGE or UPDATE) statement. These vars are automatically retained with each new iteration. I.e. they are NOT reset to missing. They are only changed when the appropriate SET/MERGE/UPDATE statement is encounted.

As an experiment run these two programs, using modified datasets A and B. The data modification assigns variable order=1 to 9 for the 9 records in A, and order=101 to 105 for the observations in B. Both programs will result in a dataset in all the A records precede all the B records.

The first program is identical to your program, except that it reports the value of W at three different points in the data step.

data a;
	infile datalines truncOver;
	input v;
	order=_n_;
	datalines;
1
2
 
 
 
3
 
4
5
;
run;

data b;
	input w;
	wCopy = w;
	order=_n_+100;
	datalines;
1
2
3
4
5
;
run;

data AB;
 put _n_=z2. ' Before set A B: ' w= +2 @;
 set a b;
 put 'After set: ' w= +2 @;
 w2 = w;

 w = v;
 put 'After reassignment: ' w=;
run;

The first 2 lines of the log includes this:

_N_=01 Before set A B: w=. After set: w=. After reassignment: w=1

_N_=02 Before set A B: w=1 After set: w=1 After reassignment: w=2

For _N_=2 notice that "before set that W=1, which was inherited (i.e. retained) from the "After reassignment" value produced in _N_=1.

This is an explicit example of your observation that variable values from SET are automatically retained,

Now the "after set" value is unchanged in _N_=2. Take a look at the analogus results for _N_=9 and 10 (last A rec and first B rec):

_N_=09 Before set A B: w=4 After set: w=. After reassignment: w=5

_N_=10 Before set A B: w=5 After set: w=1 After reassignment: w=.

Again we see the "after reassignment" value from _N_=9 retained at the start of _N_=10. But this time the "after set" value is changed, becuase it's the first record from B.

NOW ... consider the example below, which is identical except for a "by order" statement.

data AB2;
 put _n_=z2. ' Before set A B: ' w= +2 @;
 set a b;
 by order;
 put 'After set: ' w= +2 @;
 w2 = w;

 w = v;
 put 'After reassignment: ' w=;
run;

If you look at dataset AB2, you'll find that WCOPY=W2 as you originally expected.

So why does adding "by order" make AB2 different from AB? Because without the BY statement, SAS knows that it will process all the B records only after all the A records, which in turn appears to mean that the SET statement will not need to reset "dataset-B-only" variables while reading dataset A. But introduction of the BY statement means that SAS at all times has to have in its buffer a record from both A and B, so that the "by order" statement can be honored. This in turn appears to mean that the SET statement will reset ALL vars from all files, even when the record-in-hand originates from only one data set.

Proof? Look at the log now:

_N_=01 Before set A B: w=. After set: w=. After reassignment: w=1

_N_=02 Before set A B: w=1 After set: w=. After reassignment: w=2

This time, for _N_=2 (an observation coming from dataset A), W is reset to missing after the SET statement, even though it was retain as W=1 prior to the set statement.

Conclusion: The BY statement changes the set of vars modified by a SET statement.

concatenated all the B records

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

Astounding · Posted 02-16-2017 12:47 PM

Michael,

Your next-to-last paragraph sums it up very nicely. In a nutshell ...

Any variable that comes from a SAS data set is automatically retained. It doesn't matter whether the data set is brought in with SET, MERGE, or UPDATE.

When should the software re-set a retained variable automatically? In the case of a SET statement that references multiple data sets, the re-sets occur only on the first observation being read in from each data set. (If a BY statement is being used, it gets a bit more complex.)

Confirmed!

mkeintz · Posted 02-16-2017 01:35 PM

Yes, some variables are always tacitly retained (but even that statement is incomplete).

The program data vector (PDV) can be subdivided into two kinds of variables:

New vars - i.e. vars newly defined within the data step. In your case W2 is such a variable. New vars (in the absence of a retain statement) are reset to missing at the beginning of each iteration of the data step.
"Old" vars - i.e. var read in via a SET (or MERGE or UPDATE) statement. These vars are automatically retained with each new iteration. I.e. they are NOT reset to missing. They are only changed when the appropriate SET/MERGE/UPDATE statement is encounted.

As an experiment run these two programs, using modified datasets A and B. The data modification assigns variable order=1 to 9 for the 9 records in A, and order=101 to 105 for the observations in B. Both programs will result in a dataset in all the A records precede all the B records.

The first program is identical to your program, except that it reports the value of W at three different points in the data step.

data a;
	infile datalines truncOver;
	input v;
	order=_n_;
	datalines;
1
2
 
 
 
3
 
4
5
;
run;

data b;
	input w;
	wCopy = w;
	order=_n_+100;
	datalines;
1
2
3
4
5
;
run;

data AB;
 put _n_=z2. ' Before set A B: ' w= +2 @;
 set a b;
 put 'After set: ' w= +2 @;
 w2 = w;

 w = v;
 put 'After reassignment: ' w=;
run;