BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Hentiedp
Fluorite | Level 6

Hi Guys,

 

I've got a question about concentation of datasets and the new variables created during he same step. The problem is that if I update a variable that is in the one dataset but not the other. That the value is retained. I've managed to find a solution, but I'm not sure why this is happening to begin with.

 

The sample code is straight forward.

/*dummy data for testing*/

data g1;

  vals = 1.25;

  do i = 1 to 5;

    value = vals * 1;

    valsn = vals; *<-- Will not be in the second dataset.;

    output;

  end;

run;

data g2;

  vals = 1;

  do j = 1 to 7;

       vals = vals + j;

   output;

  end;

run;

 

/*Straight forward concatenation*/

data g3;

set g1(in=a) g2(in=b);

if b then do;

  if j in (1,3,5,7) then do;

    valsn = vals + (j*2);

  end;

end;

run;

proc print data=g3;

run;

 

The value for valsn is retained on all observation coming from g2 regardless. A simple fix is to initialize valsn before the (if j in (1,3,5,7) then do😉 eg: valsn = .; But I intriged to understand why this is happening to begin with. 

 

Thanks in advance!

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
FreelanceReinh
Jade | Level 19

There are indeed situations like the concatenations you describe where the automatic retention of variables read with a SET statement can be somewhat perplexing. However, this has been standard SAS behavior for ages (or ever). It is described, for example, in the online documentation of the RETAIN statement, section "Redundancy": "variables that are read with a SET, MERGE, MODIFY or UPDATE statement" are among those whose "values are automatically retained from one iteration of the DATA step to the next." This is in fact a feature and it is probably used most in match-merging.

 

The important fact that @Astounding mentioned, "when you switch data sets, variables get reinitialized", seems to be harder to find in the documentation, but here it is: in the book Language Reference: Concepts. Quote: "SAS continues to read one observation at a time from the first data set until it finds an end-of-file indicator. The values of the variables in the program data vector are then set to missing, and SAS begins reading observations from the second data set ..."

 

The 2012 conference paper The Use and Abuse of the Program Data Vector (23 pages) takes a detailed look at the internal workings of the data step. In the section "Multiple data sets in the set statement" the author presents an example similar to yours. As far as I see, his argument is wrong where he tries to explain what happens with the first observation of the second dataset! He incorrectly states that a value "remains from the previous data step iteration." I think his own example data and code prove him wrong at this point. So, you see, you're not alone with your confusion. Smiley Happy

 

In your second example VALSN is created with an assignment statement, hence it is not retained (like "all variables that are created with INPUT or assignment statements", as the online doc [see first link above] describes), unless you request this by using a RETAIN statement.

 

View solution in original post

6 REPLIES 6
kannand
Lapis Lazuli | Level 10

You have coded it that way

set g1(in=a) g2(in=b);
if b then do;
 
Kannan Deivasigamani
Hentiedp
Fluorite | Level 6

What I'm struggling to understand is why the value of VALSN is retained in the concatenation and only when VALSN exists in one of the datasets. If VALSN is created in the concatenation part then the value is not retained and acts as I would expect it.

 

 example:

/*dummy data for testing*/

data g1;

vals = 1.25;

do i = 1 to 5;

value = vals * 1;

* valsn = vals; *<-- Will not be in the second dataset.;

output;

end;

run;

data g2;

vals = 1;

do j = 1 to 7;

vals = vals + j;

output;

end;

run;

/*Straight forward concatenation*/

data g3;

set g1(in=a) g2(in=b);

if b then do;

if j in (1,3,5,7) then do;

valsn = vals + (j*2);

end;

end;

run;

proc print data=g3;

run;

 

The valsn is no retained.

Obs VALS I VALUE J VALSN

1 1.25 1 1.25

5 1.25 5 1.25

6 2.00 1 4

7 4.00 2

8 7.00 3 13

9 11.00 4

10 16.00 5 26

11 22.00 6

12 29.00 7 43

 

Where as if VALSN is created before the concatenation The result are:

Obs VALS I VALUE VALSN J

1   1.25 1 1.25  1.25

5   1.25 5 1.25  1.25

6   2.00         4.00  1

7   4.00         4.00  2

8   7.00         13.00 3

9  11.00         13.00 4

10 16.00         26.00 5

11 22.00         26.00 6

12 29.00         43.00 7

 

But the code I've got in place is applicable to: j in (1,3,5,7). Yet the values are being "retained"

kannand
Lapis Lazuli | Level 10

Sorry, I misread your post. I tried a variation of your code and my results are similar when you bring in both datasets together..... need to understand its behaviour....  Here is what I observed with the 2 scenarious I tested:

 

data d1;newvar = 0;run;

data d2;  
  do j = 1 to 5;       
     output;
  end;
run;

data d3;
set d2;
if j = 3 then newvar = 500;
run;

data d4;
set  d1 d2;
  if j = 3 then newvar = 500;
run;

proc print data=d3; title '==d3==';
proc print data=d4; title '==d4==';

 

Resulted in this output:

 

==d3==
Obs	j	newvar
1	1	.
2	2	.
3	3	500
4	4	.
5	5	.
==d4==
Obs	newvar	j
1	0	.
2	.	1
3	.	2
4	500	3
5	500	4
6	500	5

As you can see, in the first example, it presented missing values for j<> 3. However, the value was retained for j >= 3.   I need to dig in a bit and see what I can find....

Kannan Deivasigamani
FreelanceReinh
Jade | Level 19

There are indeed situations like the concatenations you describe where the automatic retention of variables read with a SET statement can be somewhat perplexing. However, this has been standard SAS behavior for ages (or ever). It is described, for example, in the online documentation of the RETAIN statement, section "Redundancy": "variables that are read with a SET, MERGE, MODIFY or UPDATE statement" are among those whose "values are automatically retained from one iteration of the DATA step to the next." This is in fact a feature and it is probably used most in match-merging.

 

The important fact that @Astounding mentioned, "when you switch data sets, variables get reinitialized", seems to be harder to find in the documentation, but here it is: in the book Language Reference: Concepts. Quote: "SAS continues to read one observation at a time from the first data set until it finds an end-of-file indicator. The values of the variables in the program data vector are then set to missing, and SAS begins reading observations from the second data set ..."

 

The 2012 conference paper The Use and Abuse of the Program Data Vector (23 pages) takes a detailed look at the internal workings of the data step. In the section "Multiple data sets in the set statement" the author presents an example similar to yours. As far as I see, his argument is wrong where he tries to explain what happens with the first observation of the second dataset! He incorrectly states that a value "remains from the previous data step iteration." I think his own example data and code prove him wrong at this point. So, you see, you're not alone with your confusion. Smiley Happy

 

In your second example VALSN is created with an assignment statement, hence it is not retained (like "all variables that are created with INPUT or assignment statements", as the online doc [see first link above] describes), unless you request this by using a RETAIN statement.

 

Hentiedp
Fluorite | Level 6

Thanks for the feedback. This is what I was looking for.

Astounding
PROC Star

It's automatic.  Any variable that comes from a SAS data set is automatically retained.

 

When you switch data sets, variables get reinitialized.  You could see this if you changed your logic to say:

 

if j in (2, 4, 6) then do;

 

Good luck.

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 6 replies
  • 1619 views
  • 6 likes
  • 4 in conversation