I have two datasets that I would like to concatenate. They have the same information, with the same formatting, but in variables with different names. I tried to combine them using the technique given in the example code below.
DATA TEST1; INPUT ID VALUE1; DATALINES; 1 0.5 2 0.6 3 0.7 4 . ; RUN; DATA TEST2; INPUT ID VALUE2; DATALINES; 5 0.4 6 0.5 7 0.3 8 . ; RUN; DATA TEST3; SET TEST1 TEST2; IF MISSING(VALUE1) THEN VALUE1 = VALUE2; RUN; PROC PRINT DATA = TEST3 NOOBS; RUN;
This is my output:
ID VALUE1 VALUE2
1 | 0.5 | . |
2 | 0.6 | . |
3 | 0.7 | . |
4 | . | . |
5 | 0.4 | 0.4 |
6 | 0.4 | 0.5 |
7 | 0.4 | 0.3 |
8 | 0.4 | . |
I tried a few tweaks to dataset TEST2 and it looks like SAS is simply assigning the first non-missing VALUE2 value to all of the rows after that non-missing value appears. For example, if ID 5 has VALUE2 = 0.5, then ID 5, 6, 7, and 8 will all be assigned VALUE1 = 0.5. If ID 5 has VALUE2 = . and ID 6 has VALUE2 = 0.5, then ID 5 will be assigned VALUE1 = . and ID's 6, 7, and 8 will be assigned VALUE2 = 0.5. (To be clear, my desired result is that VALUE1 is assigned the same value as VALUE2 for each of ID 5, 6, 7, 8, so that ID 5 has VALUE1 = 0.4, ID 6 has VALUE1 = 0.5, etc.)
For my original program, I was able to get the outcome I wanted by using the RENAME data set option instead of an if statement. But I want to understand why my code is working this way. It seems totally contrary to my understanding of the DATA step. How is SAS even able to assign the same value to all of the subsequent rows? Without a RETAIN statement, shouldn't everything except _N_ and _ERROR_ be reassigned to missing values before the next record is read? It seems impossible, which means I must be misunderstanding something fundamental about the DATA step.
DATA TEST3;
SET TEST1 TEST2;
value1 = coalesce(value1, value2);
RUN;
PROC PRINT DATA = TEST3 NOOBS; RUN;
Save it to a new variable solves the issue.
The reassigning to a variable that already exists is introducing an implicit retain. Changing the variable name fixes the issue. This at least gives you the workaround. I'll leave the full technical discussion to someone else.
Hi @EIL All variables read using a SET statement are automatically retained. In your code values of both value1 and value2 would indeed be retained unless you explicitly assign missing during earch iteration of the datastep.
Oh, I see, of course! That is very helpful information, thank you.
I'm still left confused, though. I don't understand why the IF/THEN statement would use the retained value instead of a value from the new data record. Also, shouldn't it be retaining the previous row's value each time, instead of assigning the same value to all of the subsequent records?
I tried experimenting with some other DATA statements to try to get a grip on what's happening here:
DATA TEST3;
SET TEST1 TEST2;
IF NOT MISSING(ID) THEN MYVARA = VALUE2;
MYVARB = VALUE2;
IF MISSING(VALUE1) THEN VALUE1 = VALUE2;
MYVARC = VALUE2;
RUN;
But I got the following results:
ID VALUE1 VALUE2 MYVARA MYVARB MYVARC
1 | 0.5 | . | . | . | . |
2 | 0.6 | . | . | . | . |
3 | 0.7 | . | . | . | . |
4 | . | . | . | . | . |
5 | 0.4 | 0.4 | 0.4 | 0.4 | 0.4 |
6 | 0.4 | 0.5 | 0.5 | 0.5 | 0.5 |
7 | 0.4 | 0.3 | 0.3 | 0.3 | 0.3 |
8 | 0.4 | . | . | . | . |
The variable assignments are working like they "should" for every situation except reassigning VALUE1. I'm still confused about what's different between the case where SAS assigns the first nonmissing VALUE2 value, and the cases where SAS assigns the "correct" VALUE2 value.
DATA TEST3;
SET TEST1 TEST2;
value1 = coalesce(value1, value2);
RUN;
PROC PRINT DATA = TEST3 NOOBS; RUN;
Save it to a new variable solves the issue.
The reassigning to a variable that already exists is introducing an implicit retain. Changing the variable name fixes the issue. This at least gives you the workaround. I'll leave the full technical discussion to someone else.
Variables sourced from datasets are always retained. That is why many to one merges work.
data decodes;
input code decode :$10. @@;
cards;
1 red 2 blue 3 green
;
data items;
input code id @@;
cards;
1 1 1 2 1 3
2 4 2 5
3 6 3 7
;
data want;
merge items decodes;
by code;
run;
Obs code id decode 1 1 1 red 2 1 2 red 3 1 3 red 4 2 4 blue 5 2 5 blue 6 3 6 green 7 3 7 green
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.