DATA Step, Macro, Functions and more

DATA Step UPDATE Statement Question

Accepted Solution Solved
Reply
Frequent Contributor
Posts: 83
Accepted Solution

DATA Step UPDATE Statement Question

[ Edited ]

I've inherited some code that's using a DATA step UPDATE statement, but my DATA step skills are very rusty and I want to make sure I'm understanding it correctly (I'm more of a PROC SQL guy).  The code has poor performance (largely because it's trying to move a bunch of data back and forth between SAS and Oracle) so I'm trying to convert the code to explicit pass-through SQL so I can keep all of the processing local to Oracle.

 

Here is a simplified version of the part of the code in question:

 

PROC SQL;
    CONNECT USING LIB_DB1 AS OracDB;
    CREATE TABLE  LIB_OUT.Ref_Table_1 AS
        SELECT  *
        FROM    CONNECTION TO OracDB
               (SELECT      A.txn_id
                           ,A.Column_1
                           ,B.Column_2
                FROM        SCHEMA_A.TABLE_A    A
                INNER JOIN  SCHEMA_A.TABLE_B    B
                        ON  A.KEY_A = B.KEY_B
                ORDER BY    A.txn_id
                );
    DISCONNECT FROM OracDB;
QUIT;

PROC SQL;
    CONNECT USING LIB_DB2 AS OracDB;
    CREATE TABLE  LIB_OUT.Ref_Table_2 AS
        SELECT  * 
        FROM    CONNECTION TO OracDB
               (SELECT      txn_id
                           ,Column_1
                FROM        SCHEMA_B.TABLE_C
                WHERE       COLUMN_A = &var_A
                );
    DISCONNECT FROM OracDB;
QUIT;


PROC SORT DATA=LIB_OUT.Ref_Table_1 NODUPKEY;
    BY txn_id;
RUN;

PROC SORT DATA=LIB_OUT.Ref_Table_2;
    BY txn_id;
RUN;

DATA LIB_OUT.Ref_Table_3;
    UPDATE LIB_OUT.Ref_Table_2 LIB_OUT.Ref_Table_1;
    BY txn_id;
RUN;

 

My question is on the UPDATE statement on the final DATA step, if I understand it correctly after this step is run then Ref_Table_3 will have all of the data from Ref_Table_2, but will have an added column called Column_2 with missing values where there is no matching txn_id value in Ref_Table_1 and Ref_Table_2, and the value from Ref_Table_1 where there is a matching txn_id.  Similarly, for Column_1, if there is no matching txn_id between Ref_Table_1 and Ref_Table_2, then whatever was originally was in Ref_Table_2 will remain, but where there is a matching txn_id Column_1 will be updated to whatever value is in Ref_Table_1 (or left the same if they match).

 

Is my understanding correct or have I missed something?


Accepted Solutions
Solution
‎06-01-2017 12:34 PM
Super User
Posts: 5,496

Re: DATA Step UPDATE Statement Question

You have most of it.

 

When there is a match, but when Table_1 has a missing value for Column_1, the missing value is ignored.  The value from Table_2 remains in place.

 

The first data set mentioned in the UPDATE statement (Table_2 in this case) cannot have multiple records for the same TXN_ID.  If it does, an error is generated.  However, the second data set mentioned (Table_1 in this case) is permitted to have multiple records for the same TXN_ID.  In that case, SAS outputs only one record per TXN_ID, once all the updates from Table_1 have been applied.

 

Because of that feature:

 

  • These actions are very difficult to reproduce in SQL.  It might be possible in this case, only because the second data set (Table_1) is guaranteed not to have multiple records for the same TXN_ID ... because of the sort with NODUPKEY.  You might have to use something like coalesce(Table_1.Column_1, Table_2.Column_1) to replicate the results.
  • The program itself is suspect.  Why should NODUPKEY be applied to the second data set (which is permitted to contain multiple records for the same TXN_ID), but not to the first data set (which is required to have no more than one record for the same TXN_ID).  There may be legitimate business answers to this question, but the program raises that question.

View solution in original post


All Replies
Solution
‎06-01-2017 12:34 PM
Super User
Posts: 5,496

Re: DATA Step UPDATE Statement Question

You have most of it.

 

When there is a match, but when Table_1 has a missing value for Column_1, the missing value is ignored.  The value from Table_2 remains in place.

 

The first data set mentioned in the UPDATE statement (Table_2 in this case) cannot have multiple records for the same TXN_ID.  If it does, an error is generated.  However, the second data set mentioned (Table_1 in this case) is permitted to have multiple records for the same TXN_ID.  In that case, SAS outputs only one record per TXN_ID, once all the updates from Table_1 have been applied.

 

Because of that feature:

 

  • These actions are very difficult to reproduce in SQL.  It might be possible in this case, only because the second data set (Table_1) is guaranteed not to have multiple records for the same TXN_ID ... because of the sort with NODUPKEY.  You might have to use something like coalesce(Table_1.Column_1, Table_2.Column_1) to replicate the results.
  • The program itself is suspect.  Why should NODUPKEY be applied to the second data set (which is permitted to contain multiple records for the same TXN_ID), but not to the first data set (which is required to have no more than one record for the same TXN_ID).  There may be legitimate business answers to this question, but the program raises that question.
Trusted Advisor
Posts: 1,553

Re: DATA Step UPDATE Statement Question

Posted in reply to Astounding

@Astounding note "Why should NODUPKEY be applied to the second data set ..."

 

I'm not sure,  but   LIB_OUT.Ref_Table_1  is a result of (inner) join.

In some cases it may create duplicate keys, therefor the NODUPKEY is required.

 

Frequent Contributor
Posts: 83

Re: DATA Step UPDATE Statement Question

Thanks @Astounding & @Shmuel!  

 

Yeah, after I posted this I started trying to create some sample data sets to play around with and came across that problem with multiple records.  Ignoring that issue you can pretty much duplicate the results with COALESCE statements, but when there are multiple ID's it results in duplicate rows that don't show up with the UPDATE version.  Luckily I'm pretty sure the source data is relatively clean and the multiple ID thing shouldn't be a real problem in this instance, which should simplify things substantially.

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 3 replies
  • 139 views
  • 0 likes
  • 3 in conversation