BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Sven111
Pyrite | Level 9

I've inherited some code that's using a DATA step UPDATE statement, but my DATA step skills are very rusty and I want to make sure I'm understanding it correctly (I'm more of a PROC SQL guy).  The code has poor performance (largely because it's trying to move a bunch of data back and forth between SAS and Oracle) so I'm trying to convert the code to explicit pass-through SQL so I can keep all of the processing local to Oracle.

 

Here is a simplified version of the part of the code in question:

 

PROC SQL;
    CONNECT USING LIB_DB1 AS OracDB;
    CREATE TABLE  LIB_OUT.Ref_Table_1 AS
        SELECT  *
        FROM    CONNECTION TO OracDB
               (SELECT      A.txn_id
                           ,A.Column_1
                           ,B.Column_2
                FROM        SCHEMA_A.TABLE_A    A
                INNER JOIN  SCHEMA_A.TABLE_B    B
                        ON  A.KEY_A = B.KEY_B
                ORDER BY    A.txn_id
                );
    DISCONNECT FROM OracDB;
QUIT;

PROC SQL;
    CONNECT USING LIB_DB2 AS OracDB;
    CREATE TABLE  LIB_OUT.Ref_Table_2 AS
        SELECT  * 
        FROM    CONNECTION TO OracDB
               (SELECT      txn_id
                           ,Column_1
                FROM        SCHEMA_B.TABLE_C
                WHERE       COLUMN_A = &var_A
                );
    DISCONNECT FROM OracDB;
QUIT;


PROC SORT DATA=LIB_OUT.Ref_Table_1 NODUPKEY;
    BY txn_id;
RUN;

PROC SORT DATA=LIB_OUT.Ref_Table_2;
    BY txn_id;
RUN;

DATA LIB_OUT.Ref_Table_3;
    UPDATE LIB_OUT.Ref_Table_2 LIB_OUT.Ref_Table_1;
    BY txn_id;
RUN;

 

My question is on the UPDATE statement on the final DATA step, if I understand it correctly after this step is run then Ref_Table_3 will have all of the data from Ref_Table_2, but will have an added column called Column_2 with missing values where there is no matching txn_id value in Ref_Table_1 and Ref_Table_2, and the value from Ref_Table_1 where there is a matching txn_id.  Similarly, for Column_1, if there is no matching txn_id between Ref_Table_1 and Ref_Table_2, then whatever was originally was in Ref_Table_2 will remain, but where there is a matching txn_id Column_1 will be updated to whatever value is in Ref_Table_1 (or left the same if they match).

 

Is my understanding correct or have I missed something?

1 ACCEPTED SOLUTION

Accepted Solutions
Astounding
PROC Star

You have most of it.

 

When there is a match, but when Table_1 has a missing value for Column_1, the missing value is ignored.  The value from Table_2 remains in place.

 

The first data set mentioned in the UPDATE statement (Table_2 in this case) cannot have multiple records for the same TXN_ID.  If it does, an error is generated.  However, the second data set mentioned (Table_1 in this case) is permitted to have multiple records for the same TXN_ID.  In that case, SAS outputs only one record per TXN_ID, once all the updates from Table_1 have been applied.

 

Because of that feature:

 

  • These actions are very difficult to reproduce in SQL.  It might be possible in this case, only because the second data set (Table_1) is guaranteed not to have multiple records for the same TXN_ID ... because of the sort with NODUPKEY.  You might have to use something like coalesce(Table_1.Column_1, Table_2.Column_1) to replicate the results.
  • The program itself is suspect.  Why should NODUPKEY be applied to the second data set (which is permitted to contain multiple records for the same TXN_ID), but not to the first data set (which is required to have no more than one record for the same TXN_ID).  There may be legitimate business answers to this question, but the program raises that question.

View solution in original post

3 REPLIES 3
Astounding
PROC Star

You have most of it.

 

When there is a match, but when Table_1 has a missing value for Column_1, the missing value is ignored.  The value from Table_2 remains in place.

 

The first data set mentioned in the UPDATE statement (Table_2 in this case) cannot have multiple records for the same TXN_ID.  If it does, an error is generated.  However, the second data set mentioned (Table_1 in this case) is permitted to have multiple records for the same TXN_ID.  In that case, SAS outputs only one record per TXN_ID, once all the updates from Table_1 have been applied.

 

Because of that feature:

 

  • These actions are very difficult to reproduce in SQL.  It might be possible in this case, only because the second data set (Table_1) is guaranteed not to have multiple records for the same TXN_ID ... because of the sort with NODUPKEY.  You might have to use something like coalesce(Table_1.Column_1, Table_2.Column_1) to replicate the results.
  • The program itself is suspect.  Why should NODUPKEY be applied to the second data set (which is permitted to contain multiple records for the same TXN_ID), but not to the first data set (which is required to have no more than one record for the same TXN_ID).  There may be legitimate business answers to this question, but the program raises that question.
Shmuel
Garnet | Level 18

@Astounding note "Why should NODUPKEY be applied to the second data set ..."

 

I'm not sure,  but   LIB_OUT.Ref_Table_1  is a result of (inner) join.

In some cases it may create duplicate keys, therefor the NODUPKEY is required.

 

Sven111
Pyrite | Level 9

Thanks @Astounding & @Shmuel!  

 

Yeah, after I posted this I started trying to create some sample data sets to play around with and came across that problem with multiple records.  Ignoring that issue you can pretty much duplicate the results with COALESCE statements, but when there are multiple ID's it results in duplicate rows that don't show up with the UPDATE version.  Luckily I'm pretty sure the source data is relatively clean and the multiple ID thing shouldn't be a real problem in this instance, which should simplify things substantially.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 3 replies
  • 749 views
  • 0 likes
  • 3 in conversation