12-28-2015 07:31 AM
I would like to know which is the best strategy in a ETL incremental load.
I have a CDC system that capture changes that I read, validate, transform and load in the target DB.
These changes belong to a period, all the rows that I treat correspond to a month (for example in Dec I treat November rows, and in January I will treat Decemeber rows).
For example in December, I am treating November rows and for the loading I have to optines:
1) In every ETL incremental load delete all rows of Nov in the target DB and insert the new rows.
- delete from targerdwb where sales_date >= 01/11/2015 and sales_date <= 30/11/2015
- insert CDC rows
2) In every incremental load I detect the rows that are news and the rows have to update.
In order to detect inserts or updates I have to make a join between rows to load and rows in the target DB.
Once detected inserts or updates (flag) I wil execute the SQL statements
I suppose the strategy depends in the volumen of the target DB and the volumen of rows in the CDC capture.
More or less the rows in the CDC are 20% at rhe beggining of the period and 5% at the end.
I will do performance tests to know the faster way of do the load, but I would like to know if there is any other thing
that will be usefull to take in consideration or if there are any other way to do the incremenrtal load.
Any advices about ETL incremental load will be usefull
Thanks in advance
12-28-2015 11:51 AM
Nice design issue!
But before giving any advice, just a few classification questions.
Then I'm not really following you on the 20 to 5%. Period means month, right? Do you mean that you load November multiple times in December?
From a strict performance perspective, I would partition the target table by month so you can drop it with almost no performance cost.
On the other hand, if data for a business key changes throughout the month, you perhaps should store all versions for an business key. If so, your are looking at a SCD Type 2 load style.
12-28-2015 12:04 PM
Thanks Linush from your answer
In the CDC records I have transactions, in my target table is a detail DW layer partitioned by month
My CDC records are 20% of the target table in the beggigng of december (my firsts etl process of november data)
and 5% at the end of december ( my lasts etl process of november data)
I think the best strategy to load the DW layer is insert/update, previsoly have to mark 'updates records' and 'insert records'
I have otther DW table that is aggregated from DW detailed layer. I think I will load it from the DW detailed layer, no from CDC records..., becuase to load aggegrated table from CDC records can be dificult...
Any advice will be greatly appreciated
12-28-2015 12:19 PM
And I would suggest that you keep the old versions of the updated records, since I guess that you do some reporting of the November records in all through December?
CDC would make this easy for you. Records marked as updates should first update the current record in the target table as "closed", preferably with the date/datetime -1 for the update record (this columns is usually named valid_to_dt/dttm). What update method to chose depends on your target RDBMS.
Then just insert both inserts and update records. Use the CDC timestamp as valid_from_dt/dttm. For valid_to_dttm use a high date/datetime.
If you map your CDC source to the target kinda 1-1, there should be no need to use the SCD Type Loader, CDC have already figured out what and how to update.