ETL incremental load

juanvg1972 · Posted 12-28-2015 07:31 AM

Hi,

I would like to know which is the best strategy in a ETL incremental load.

I have a CDC system that capture changes that I read, validate, transform and load in the target DB.

These changes belong to a period, all the rows that I treat correspond to a month (for example in Dec I treat November rows, and in January I will treat Decemeber rows).

For example in December, I am treating November rows and for the loading I have to optines:

1) In every ETL incremental load delete all rows of Nov in the target DB and insert the new rows.

- delete from targerdwb where sales_date >= 01/11/2015 and sales_date <= 30/11/2015

- insert CDC rows

2) In every incremental load I detect the rows that are news and the rows have to update.

In order to detect inserts or updates I have to make a join between rows to load and rows in the target DB.

Once detected inserts or updates (flag) I wil execute the SQL statements

I suppose the strategy depends in the volumen of the target DB and the volumen of rows in the CDC capture.

More or less the rows in the CDC are 20% at rhe beggining of the period and 5% at the end.

I will do performance tests to know the faster way of do the load, but I would like to know if there is any other thing

that will be usefull to take in consideration or if there are any other way to do the incremenrtal load.

Any advices about ETL incremental load will be usefull

Thanks in advance

LinusH · Posted 12-28-2015 11:51 AM

Nice design issue!

But before giving any advice, just a few classification questions.

What is the nature of your CDC records? Do they represent tranasactions, or more presistent type of data (customer, account etc)?
What is the nature of your target DB? Is it a detail DW layer? What RDBMS are you using?

Then I'm not really following you on the 20 to 5%. Period means month, right? Do you mean that you load November multiple times in December?

From a strict performance perspective, I would partition the target table by month so you can drop it with almost no performance cost.

On the other hand, if data for a business key changes throughout the month, you perhaps should store all versions for an business key. If so, your are looking at a SCD Type 2 load style.

Data never sleeps

juanvg1972 · Posted 12-28-2015 12:04 PM

Thanks Linush from your answer

In the CDC records I have transactions, in my target table is a detail DW layer partitioned by month

My CDC records are 20% of the target table in the beggigng of december (my firsts etl process of november data)

and 5% at the end of december ( my lasts etl process of november data)

I think the best strategy to load the DW layer is insert/update, previsoly have to mark 'updates records' and 'insert records'

I have otther DW table that is aggregated from DW detailed layer. I think I will load it from the DW detailed layer, no from CDC records..., becuase to load aggegrated table from CDC records can be dificult...

Any advice will be greatly appreciated

LinusH · Posted 12-28-2015 12:19 PM

Sounds good.

And I would suggest that you keep the old versions of the updated records, since I guess that you do some reporting of the November records in all through December?

CDC would make this easy for you. Records marked as updates should first update the current record in the target table as "closed", preferably with the date/datetime -1 for the update record (this columns is usually named valid_to_dt/dttm). What update method to chose depends on your target RDBMS.

Then just insert both inserts and update records. Use the CDC timestamp as valid_from_dt/dttm. For valid_to_dttm use a high date/datetime.

If you map your CDC source to the target kinda 1-1, there should be no need to use the SCD Type Loader, CDC have already figured out what and how to update.

Data never sleeps

ETL incremental load

Re: ETL incremental load

Re: ETL incremental load

Re: ETL incremental load

ETL incremental load

Re: ETL incremental load

Re: ETL incremental load

Re: ETL incremental load

Registration is open