SAS Data Integration Studio, DataFlux Data Management Studio, SAS/ACCESS, SAS Data Loader for Hadoop and others

ETL incremental load

Reply
Frequent Contributor
Posts: 122

ETL incremental load

Hi,

 

I would like to know which is the best strategy in a ETL incremental load.

I have a CDC system that capture changes that I read, validate, transform and load in the  target DB.

These changes belong to a period, all the rows that I treat correspond to a month (for example in Dec I treat November rows, and in January I will treat Decemeber rows).

 

For example in December, I am treating November rows and  for the loading I have to optines:

 

1) In every ETL incremental load delete all rows of Nov in the target DB and insert the new rows.

- delete from targerdwb where sales_date >= 01/11/2015 and sales_date <= 30/11/2015

- insert CDC rows

 

2) In every incremental load I detect the rows that are news and the rows have to update.

In order to detect inserts or updates I have to make a join between rows to load and rows in the target DB.

Once detected inserts or updates (flag)  I wil execute the SQL statements

 

I suppose the strategy depends in the volumen of the target DB and the volumen of rows in the CDC capture.

More or less the rows in the CDC are 20% at rhe beggining of the period and 5% at the end.

 

I will do performance tests to know the faster way of do the load, but I would like to know if there is any other thing

that will be usefull to take in consideration or if there are any other way to do the incremenrtal load.

 

Any advices about ETL incremental load will be usefull

 

Thanks in advance

Super User
Posts: 5,424

Re: ETL incremental load

Posted in reply to juanvg1972

Nice design issue!

But before giving any advice, just a few classification questions.

  • What is the nature of your CDC records? Do they represent tranasactions, or more presistent type of data (customer, account etc)?
  • What is the nature of your target DB? Is it a detail DW layer? What RDBMS are you using?

Then I'm not really following you on the 20 to 5%. Period means month, right? Do you mean that you load November multiple times in December?

 

From a strict performance perspective, I would partition the target table by month so you can drop it with almost no performance cost.

On the other hand, if data for a business key changes throughout the month, you perhaps should store all versions for an business key. If so, your are looking at a SCD Type 2 load style.

 

Data never sleeps
Frequent Contributor
Posts: 122

Re: ETL incremental load

Thanks Linush from your answer

 

In the CDC records I have transactions, in my target table is a detail DW  layer partitioned by month

 

My CDC records are 20% of the target table in the beggigng of december (my firsts etl process of november data)

and 5% at the end of december ( my lasts etl process of november data)

 

I think the best strategy to load the DW layer is insert/update, previsoly  have to mark 'updates records' and 'insert records'

 

I have otther DW table that is aggregated from DW detailed layer. I think I will load it from the DW detailed layer, no from CDC records..., becuase to load aggegrated table from CDC records can be dificult...

 

Any advice will be greatly appreciated

Super User
Posts: 5,424

Re: ETL incremental load

Posted in reply to juanvg1972

Sounds good.

And I would suggest that you keep the old versions of the updated records, since I guess that you do some reporting of the November records in all through December?

 

CDC would make this easy for you. Records marked as updates should first update the current record in the target table as "closed", preferably with the date/datetime -1 for the update record (this columns is usually named valid_to_dt/dttm). What update method to chose depends on your target RDBMS.

 

Then just insert both inserts and update records. Use the CDC timestamp as valid_from_dt/dttm. For valid_to_dttm use a high date/datetime.

 

If you map your CDC source to the target kinda 1-1, there should be no need to use the SCD Type Loader, CDC have already figured out what and how to update.

 

Data never sleeps
Ask a Question
Discussion stats
  • 3 replies
  • 609 views
  • 1 like
  • 2 in conversation