We’re smarter together. Learn from this collection of community knowledge and add your expertise.

LASR Changed Data Capture (CDC) Update

by SAS Employee UttamKumar on ‎09-05-2017 02:08 PM (749 Views)

So you need to refresh a LASR table. What’s the best way? If you can isolate the changes since the last LASR load, you don’t have enough memory to accommodate a SWAP-OUT update, and you have mostly inserts to do, it may be preferable to perform a CDC update as opposed to reloading the entire table.

 

This blog post discusses a mechanism for updating a very large LASR table using a Changed Data Capture (CDC) table, a single table with all of the insert/update/delete operations required to bring the target LASR table up to date.  

 

To apply CDC on a LASR table, the proposed mechanism uses the following steps:

  • Load the CDC table into LASR
  • Append the “Insert” rows from the CDC table to the target LASR table
  • Delete “Insert” rows from the CDC table
  • Update the target LASR table with the “Update” rows from the CDC table using a computed “_where_” column. (This step also updates the target LASR table with the ‘delete’ rows from the CDC table, identifying them for deletion).
  • Purge the "Delete" rows from the target LASR table.

 

 

The CDC Table

An example CDC table is shown below. It contains all of the fields of the target LASR table with the latest field values. The table also contains an opcode field with instructions for how to update the target table – insert (‘i’), update (‘u’), or delete (‘d’).  

 

TRANSACTION_TABLE

 

LASR_cdc_1.png

 

 

The OpCode field

If you have worked with CDC in the past, the ‘opcode’ on the CDC table is familiar. The operation code (opcode) indicates the type of action to be executed for the data row. In this case, however, we also include the opcode field on the target LASR table as well to facilitate deletes. When performing deletes on target table, we first identify the rows for deletion by changing the opcode field to “d” and then deleting any row with a “d”. The example target table, base_table, with the opcode field is shown below.  

 

 

BASE_TABLE (The target LASR table)

 

LASR_cdc_2.png

 

 

The SAS Code Steps

The CDC update code is described below. Prior to performing the update, the code sets options for LASR and defines LIBNAMEs for source data and LASR.

 

Step-1 loads the transaction_table (CDC) into LASR server from source environment.

 

Step-2 appends the data from transaction_table (CDC) to base_table (the target LASR table) where opcode =’i’.

 

Step-3 deletes and purges rows from transaction_table (CDC) where opcode =’i’. With the insert step completed, we no longer need the insert records.

 

Step-4 computes a new column ( _where_ ) on transaction_table (CDC) containing a WHERE clause expression. The new column is used by PROC IMTAT to update the base table. The WHERE clause expression is generated using the target table’s key columns, here ‘ID’. For example the computed _where_ field might contain the condition, “ID = 5” indicating that only the row with primary key value of 5 will be updated with the data values on the incoming row. The CDC table with the _where_ column computed is shown below.

 

LASR_cdc_3.png 

 

 

Step-5 updates base_table from transaction_table (CDC) based on the computed _where_ column expression. At this stage, transaction_table only contains opcode=’d’ and ‘u’ rows. Only base_table rows that match the WHERE condition will be updated. The column values in transaction_table overwrite the values in base_table for all matching columns. After this step, the update is done but the delete is only half done. All matching “d” rows from transaction_table have updated the corresponding base_table rows with an opcode of “d”. Thus they are identified for deletion.

 

Step-6 deletes and purges rows from base_table where opcode=’d’ (those identified for deletion in the previous step.  

 

 

Code:

OPTION SET=GRIDINSTALLLOC="./opt/sas/TKGrid";

OPTION SET=GRIDHOST="sasserver01";

libname ld "/opt/sas/data/AppalachianChocolate_datamart";

libname lasr1 sasiola port=10010 tag='HPS';

 

/** Step-1 **/

/** Load the CDC table to LASR**/

%vdb_dt(lasr1.transaction_table);

data lasr1.transaction_table;

set ld.transaction_table;

run ;

 

/** Step-2 **/

/**Append the CDC table ‘i’ records to the target LASR table **/

proc imstat;

table lasr1.base_table ;

set transaction_table ;

where opcode='i' ;

run;

 

/** Step-3 ***/

/**Remove the ‘i’ records from the CDC table**/

table lasr1.transaction_table;

where opcode='i' ;

deleterows / purge;

run;

 

/** Step-4 ***/

/**Create the _where_ field used for update**/

proc imstat;

table lasr1.transaction_table;

compute _where_ '_where_ = strip("id = " || id)';

run;

 

 

/** Step-5***/

/**Update the target LASR table with the ‘u’ and ‘d’ records**/

table lasr1.base_table ;

update data=lasr1.transaction_table;

run;

 

/** Step-6***/

/**Delete the target LASR table records with an opcode of ‘d’ (updated in the previous step**/

where opcode='d';

deleterows / purge;

run;

 

*Drop transaction table form LASR server;

%vdb_dt(lasr1.transaction_table);  

 

 

 

BASE_TABLE (The target LASR table) after code execution:

LASR_cdc_4.png

 

 

 

A Note on Performance

LASR update is a very CPU-intensive process and, in cases with large update requirements (e.g. thousands of rows), it’s better to use a SWAP-OUT update than a CDC update strategy. For reference, in our tests, we saw that a 1000 row update on a 10GB (36 Million row) LASR table was orders of magnitude slower than using a swap-out technique.

 

So, the valid use-cases for LASR CDC update are:

1. Very few updates (10s to 100s of rows) or

2. Memory constraints that won't accommodate the swap.    

 

Note: You can update the values of ORDERBY variables, but you cannot update the value of variables that are used for constructing partition keys.       

 

 

 

Reference documents IMSTAT Procedure

Contributors
Your turn
Sign In!

Want to write an article? Sign in with your profile.