I've inherited some code that's using a DATA step UPDATE statement, but my DATA step skills are very rusty and I want to make sure I'm understanding it correctly (I'm more of a PROC SQL guy). The code has poor performance (largely because it's trying to move a bunch of data back and forth between SAS and Oracle) so I'm trying to convert the code to explicit pass-through SQL so I can keep all of the processing local to Oracle.
Here is a simplified version of the part of the code in question:
PROC SQL;
CONNECT USING LIB_DB1 AS OracDB;
CREATE TABLE LIB_OUT.Ref_Table_1 AS
SELECT *
FROM CONNECTION TO OracDB
(SELECT A.txn_id
,A.Column_1
,B.Column_2
FROM SCHEMA_A.TABLE_A A
INNER JOIN SCHEMA_A.TABLE_B B
ON A.KEY_A = B.KEY_B
ORDER BY A.txn_id
);
DISCONNECT FROM OracDB;
QUIT;
PROC SQL;
CONNECT USING LIB_DB2 AS OracDB;
CREATE TABLE LIB_OUT.Ref_Table_2 AS
SELECT *
FROM CONNECTION TO OracDB
(SELECT txn_id
,Column_1
FROM SCHEMA_B.TABLE_C
WHERE COLUMN_A = &var_A
);
DISCONNECT FROM OracDB;
QUIT;
PROC SORT DATA=LIB_OUT.Ref_Table_1 NODUPKEY;
BY txn_id;
RUN;
PROC SORT DATA=LIB_OUT.Ref_Table_2;
BY txn_id;
RUN;
DATA LIB_OUT.Ref_Table_3;
UPDATE LIB_OUT.Ref_Table_2 LIB_OUT.Ref_Table_1;
BY txn_id;
RUN;
My question is on the UPDATE statement on the final DATA step, if I understand it correctly after this step is run then Ref_Table_3 will have all of the data from Ref_Table_2, but will have an added column called Column_2 with missing values where there is no matching txn_id value in Ref_Table_1 and Ref_Table_2, and the value from Ref_Table_1 where there is a matching txn_id. Similarly, for Column_1, if there is no matching txn_id between Ref_Table_1 and Ref_Table_2, then whatever was originally was in Ref_Table_2 will remain, but where there is a matching txn_id Column_1 will be updated to whatever value is in Ref_Table_1 (or left the same if they match).
Is my understanding correct or have I missed something?
You have most of it.
When there is a match, but when Table_1 has a missing value for Column_1, the missing value is ignored. The value from Table_2 remains in place.
The first data set mentioned in the UPDATE statement (Table_2 in this case) cannot have multiple records for the same TXN_ID. If it does, an error is generated. However, the second data set mentioned (Table_1 in this case) is permitted to have multiple records for the same TXN_ID. In that case, SAS outputs only one record per TXN_ID, once all the updates from Table_1 have been applied.
Because of that feature:
You have most of it.
When there is a match, but when Table_1 has a missing value for Column_1, the missing value is ignored. The value from Table_2 remains in place.
The first data set mentioned in the UPDATE statement (Table_2 in this case) cannot have multiple records for the same TXN_ID. If it does, an error is generated. However, the second data set mentioned (Table_1 in this case) is permitted to have multiple records for the same TXN_ID. In that case, SAS outputs only one record per TXN_ID, once all the updates from Table_1 have been applied.
Because of that feature:
@Astounding note "Why should NODUPKEY be applied to the second data set ..."
I'm not sure, but LIB_OUT.Ref_Table_1 is a result of (inner) join.
In some cases it may create duplicate keys, therefor the NODUPKEY is required.
Thanks @Astounding & @Shmuel!
Yeah, after I posted this I started trying to create some sample data sets to play around with and came across that problem with multiple records. Ignoring that issue you can pretty much duplicate the results with COALESCE statements, but when there are multiple ID's it results in duplicate rows that don't show up with the UPDATE version. Luckily I'm pretty sure the source data is relatively clean and the multiple ID thing shouldn't be a real problem in this instance, which should simplify things substantially.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.