SAS Data Integration Studio, DataFlux Data Management Studio, SAS/ACCESS, SAS Data Loader for Hadoop and others

SAS DI Studio and delta load

Accepted Solution Solved
Reply
Contributor
Posts: 71
Accepted Solution

SAS DI Studio and delta load

Hi SAS DI Studio experts,

I have a operational/transactional database that increases by about 10m rows a year
but changes of a few hundret thousand rows could happen on a single day.
Some rows inserted years ago could also be corrected. I need to find the DELTA using
SAS DI Studio to update my DWH (sas datasets). I cant load every everning the whole
Oracle database. Oracle has a feature called Materialized Views, those keeo
up-to-date by themselves if correctly configured but in SAS, I dont see how its done.


Any ideas?


Accepted Solutions
Solution
‎06-02-2016 03:24 AM
Occasional Contributor
Posts: 11

Re: SAS DI Studio and delta load

Hi Philip

 

You need a method to identify changed rows in the source table(s); this can be one of several - I won't go into all the complexities, but broadly:

1. use change/update timestamps on the source rows (if your application provides them) to select - you'll also need to maintain or derive a "last DW extract" timestamp as your lower bound value for selection

2. use Oracle's CDC feature, which may need extra licensing and requires some setup by the DBA (see https://docs.oracle.com/cd/B28359_01/server.111/b28313/cdc.htm#i1025409 for details); if you use this you should also be able to use DI Studio CDC transform, although I have to say I haven't used that myself

3. if your transactional system keeps a journal of transacted changes you could use that to build a 'delta for extract' set of keys to select against; again you'll need to time-boundary limit what you use from the journal

4. brute force comparison - extract the whole table and compare [which may be inefficient, as you can imagine]; either 'proc compare' or use MD5 checksum comparison

 

Effort for 1 or 2 is moderate and both can be reasonably efficient (although 1 is dependent on how well the selection is processed); effort for 3 is a bit more, but is also fairly efficient - but you'll need to avoid heterogeneous queries by either preparing your extract keyset in Oracle, or loading the keys to a temporary table otherwise you risk having SAS try to do all the work itself.

Effort for 4 is also a bit more and it can consume more resources, both run-time and disk storage - but remember that full-table extract can be less demand on the DBMS than a complex query so your full-table extract *might* place less load on the DBMS.

 

You should also think about if and how the changes will be recorded in the DW table(s) - do you need to preserve past history? ('type 2' change recording) or is your intention just to reduce data extract volume?

 

I hope this helps

 

Regards

Steve M

View solution in original post


All Replies
Solution
‎06-02-2016 03:24 AM
Occasional Contributor
Posts: 11

Re: SAS DI Studio and delta load

Hi Philip

 

You need a method to identify changed rows in the source table(s); this can be one of several - I won't go into all the complexities, but broadly:

1. use change/update timestamps on the source rows (if your application provides them) to select - you'll also need to maintain or derive a "last DW extract" timestamp as your lower bound value for selection

2. use Oracle's CDC feature, which may need extra licensing and requires some setup by the DBA (see https://docs.oracle.com/cd/B28359_01/server.111/b28313/cdc.htm#i1025409 for details); if you use this you should also be able to use DI Studio CDC transform, although I have to say I haven't used that myself

3. if your transactional system keeps a journal of transacted changes you could use that to build a 'delta for extract' set of keys to select against; again you'll need to time-boundary limit what you use from the journal

4. brute force comparison - extract the whole table and compare [which may be inefficient, as you can imagine]; either 'proc compare' or use MD5 checksum comparison

 

Effort for 1 or 2 is moderate and both can be reasonably efficient (although 1 is dependent on how well the selection is processed); effort for 3 is a bit more, but is also fairly efficient - but you'll need to avoid heterogeneous queries by either preparing your extract keyset in Oracle, or loading the keys to a temporary table otherwise you risk having SAS try to do all the work itself.

Effort for 4 is also a bit more and it can consume more resources, both run-time and disk storage - but remember that full-table extract can be less demand on the DBMS than a complex query so your full-table extract *might* place less load on the DBMS.

 

You should also think about if and how the changes will be recorded in the DW table(s) - do you need to preserve past history? ('type 2' change recording) or is your intention just to reduce data extract volume?

 

I hope this helps

 

Regards

Steve M

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 1 reply
  • 482 views
  • 3 likes
  • 2 in conversation