HI, I am having an issue combining datasets and I am looking for help as to what is going on.
I have two datasets, all fields are character.
get_test1_nodup has the following data:
Spot_2 | Spot_4 | Spot_3 |
1 | 7 | 2 |
1 | 7 | 1 |
1 | 1 | 1 |
1 | 1 | 2 |
get_test2_nodup has the following data:
orig_identifier | identifier | Spot_2 | Spot_4 | Spot_3 |
|
| 1 | 7 | 1 |
112 | 112 |
|
|
|
111 | 111 |
|
|
|
|
| 1 | 1 | 2 |
|
| 1 | 1 | 1 |
I combine the datasets with the following code:
data combine;
set get_test1_nodup
get_test2_nodup
;
if identifier ='' then identifier=Spot_2||Spot_4||Spot_3;
run;
Here is the resulting dataset:
orig_identifier | identifier | Spot_2 | Spot_4 | Spot_3 |
| 172 | 1 | 7 | 2 |
| 172 | 1 | 7 | 1 |
| 172 | 1 | 1 | 1 |
| 172 | 1 | 1 | 2 |
| 171 | 1 | 7 | 1 |
112 | 112 |
|
|
|
111 | 111 |
|
|
|
| 112 | 1 | 1 | 2 |
| 111 | 1 | 1 | 1 |
The identifier field did not compute as I intended. Take the second row for example, identifier should be 171, not 172. What is causing this? Furthermore, where is it getting the 172 value its putting in the field? I have resolved this by combining the datasets first and then creating another dataset with the if statement, but why can't it be done in one step? I'd like to understand what is going on here.
Because IDENTIFIER is coming from one of the input datasets it is RETAINed (that it it is NOT reset to missing at the start of the data step iteration). But it is NOT in TEST1. So when you read a new observation from TEST1 the value does not change. So the test in your IF statement fails after the first observation from TEST1 since IDENTIFIER is no longer blank.
You could create a NEW variable.
new_identifier = coalescec(identifier, cats(Spot_2,Spot_4,Spot_3));
Or use IN= dataset option variables to drive your IF/THEN statement.
data combine;
set get_test1_nodup (in=in1)
get_test2_nodup (in=in2)
;
if in1 then identifier=Spot_2||Spot_4||Spot_3;
run;
Because IDENTIFIER is coming from one of the input datasets it is RETAINed (that it it is NOT reset to missing at the start of the data step iteration). But it is NOT in TEST1. So when you read a new observation from TEST1 the value does not change. So the test in your IF statement fails after the first observation from TEST1 since IDENTIFIER is no longer blank.
You could create a NEW variable.
new_identifier = coalescec(identifier, cats(Spot_2,Spot_4,Spot_3));
Or use IN= dataset option variables to drive your IF/THEN statement.
data combine;
set get_test1_nodup (in=in1)
get_test2_nodup (in=in2)
;
if in1 then identifier=Spot_2||Spot_4||Spot_3;
run;
In other words:
Because IDENTIFIER is coming from one of the input data sets it is RETAINed, and only reset when read again from a data set. Since it is not present in TEST1, it is never reset as long as TEST1 is read. And since it takes a value on the first iteration, it keeps this value until it is read from TEST2 and is not blank.
You can create a new variable, or you can test the source:
data COMBINE;
set TEST1 (in=TEST1)
TEST2 ;
if TEST1 | IDENTIFIER = ' ' then IDENTIFIER=catt(SPOT_2,SPOT_4,SPOT_3) ;
run;
I think i'm starting to understand. Why does it retain the value by default? I thought that was the job of the retain statement? Would you elaborate on why it satisfies the if condition on the first obs, but not on any subsequent records on the first dataset? I dont see how SAS sees it as blank for only the first record.
All data set variables are retained until overwritten by another data set value. That's the way a data step works.
Since TEST1 doesn't overwrite the value, IDENTIFIER stays untouched across iterations. IDENTIFIER is changed by your code when its value is blank. Thereafter it's not blank so the test fails and it remains as is.
Why does it satisfy the if statement on the first observation?
So on the first obs it is blank, but all lines after that they are missing (therefore not satisfying the if statement)? I guess i'm not understanding why the first one it evaluates as blank, but after that they are evaluated as missing.
> I guess i'm not understanding why the first one it evaluates as blank, but after that they are evaluated as missing.
blank and missing are the same thing for a string.
I just read the first few pages of the sas paper on the program data vector and reread all of these responses 5 times.
My understanding of the dataset creation is as follows:
1. The data step initializes and creates a list of all variables from both test1 and test2.
2. In the PDV, all variables are initialized to missing(according to the sas document) and all variables are retained until another round of the by statement.
3. The first iteration of reading in values for each variable from test1 occurs. It pulls in all values for all matching variables, but because test1 does not have the identifier variable, it keeps its value of missing.
4. My if statement is processed, because it is missing, it combines spot 2, 3, 4.
5. The next iteration of the datastep is executed, because identifier is not on the first dataset, it retains the value from the previous step due to the automatic retain in step 2 above.
6. This loop continues until identifier has a value, in this case from the test2 dataset.
The paper i am referencing(page 3):
http://support.sas.com/resources/papers/proceedings13/125-2013.pdf
Do i understand the process correctly?
That is how data steps work. You normally don't notice because most steps are in the form:
data new;
set old;
So the first thing the iteration does is read in the data from the old dataset.
This is why one to many merges can work. The observation from the dataset with just one observation is read the first time and then the values for the variables it is contributing stay the same until the next set of BY values are reached.
Apropos of @ChrisNZ 's explanation, you could easiliy fix with an explict OUTPUT statement followed by a CALL MISSING.
data combine;
set get_test1_nodup
get_test2_nodup ;
if identifier ='' then identifier=Spot_2||Spot_4||Spot_3;
output;
call missing(of _all_);
run;
Or even
data combine;
call missing(of _all_);
set get_test1_nodup
get_test2_nodup ;
if identifier ='' then identifier=Spot_2||Spot_4||Spot_3;
run;
@ChrisNZ wrote:
Or even
data combine; call missing(of _all_); set get_test1_nodup get_test2_nodup ; if identifier ='' then identifier=Spot_2||Spot_4||Spot_3; run;
That cannot work, because of the way that the _ALL_ keyword is evaluated at the point it is seen, not after all of the data step has been compiled. In fact it won't even run.
1193 data combine; 1194 call missing(of _all_); ------- 252 ERROR 252-185: The MISSING subroutine call does not have enough arguments. 1195 set test1 test2 ; 1196 if identifier ='' then identifier=Spot_2||Spot_4||Spot_3; 1197 run;
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.