BookmarkSubscribeRSS Feed

Using the best of delta updated DW tables and snapshot retrieval of data

Started ‎01-09-2015 by
Modified ‎10-05-2015 by
Views 4,382

Did you ever want to be able to produce a lot of historic data for your users, and at the same time not have to administrate a huge amount of data?

 

Delta updated tables in your DW is a very effective way to preserve disk space, but not that easy for everybody to retrieve data from later on.

 

Snapshot data is a lot easier to access for the end user, whether it be the programmer or the analyst.

Snapshot data is every piece of data layed out every day, making it easy to pick all your data at a specific date.

It's redundant as hell and some database administrators are not that pleased with this approach.

 

That's why I took the best from both worlds and combined it into a tool that preserves space and makes it easy to retrieve your data.

 

It gives a little extra work on the data preparation side, but it's actually quite easy to handle.

 

In the attached Excel document there is a simple example of how to model your DW tables and create a snapshot view on top of them, making it easy to retrieve data based on specific dates.

 

The table CUSTOMER_BASE is the base table holding all the information about the customers.

It has an ID column that is unique for every record in the table.

The FROMDATE column tells us from what date the actual record apply.

It also has a column names SOCSECNO (social security number) with an index. This is used for selecting data in the example.

 

The table CUSTOMER_SNAP is the actual snapshot table, holding a reference to the base data for a customer, data that applies at the date found in the DATE column.

This table holds a record for each customer for each and every date up until now.

You have to make a small batch program/job that creates this table from the BASE table, and when adding data to the BASE table you have to make some housekeeping in the SNAP table.

 

On top of these two tables put a view that joins the table on the ID column.

Now your users can easily access data from a delta updated DW table using the snapshot access principle.

Comments

Very nice post, powerful and simple!

A code sample for the creation of the snap table would make it perfect.

Have you tested this on large dimensions/tables? How is this helper table affecting performance?

I have not tested it on very large tables yet but will do so in the coming week. I will publish the results here. The test will include joins of views.

I would really like to have access to the pdf but for some reason when I try to download the pdf files from this web site I only get a single page in the pdf file. Can I get the pdf from other place?

I'm only seeing one page of the PDF too, CTorres .

bl@jyskebank.dk, is that the whole file?

Anna

There is only one page in the PDF. I had some problems using this forum with IE9, so switched to Firefox. It solved my problems using the forum. In the text I mentioned that an Excel file is attached. It is of course a PDF file. It has only one page. I'm conducting performance analysis right now to see how it performs when joining tables with millions of records. The test is carried out on Oracle.

When the test has finished I will publish results and sample code.

OK, sounds good! Thanks for clarifying and posting your results when ready.

Anna

Got held up by some Oracle space problems and other activities. Cannot guarantee a date for the results, but I will be back as soon as the tests are finished. At that time I have some more information on how to potentially use this setup for a near-real-time Data Warehouse.

One step further...

We have conducted some performance tests on Oracle, and it shows okay performance.

We made a couple of huge Oracle tables and put views on top of them one-to-one, and put a joined view on top of them as well.

When opening an Oracle view directly from SAS Foundation in ViewTable it responds immediately showing the contents of the view.

When opening a joined Oracle view on the huge tables, it takes some minutes before it responds.

This is because Oracle has to join the tables and that takes time.

When doing SQL or datastep queries on the joined view it responds okay / fast when using a where clause.

No unexpected bad performance.

We tested on billions of records in the tables. Of course the Oracle server setup means something, but I have no details on that.

We are considering partitioning the tables in Oracle for even better performance, but for now we are not doing that.

Next step is implementing this setup in our Data Warehouse. We are currently working on that.

This prepares us for doing near-real-time updates in the Warehouse.

We could query (service requests) any changes in customers basic information every hour, and in return get a few hundred or thousand records.

We would then process these few records by inserting them directly into the delta table and adjust the time table accordingly.

The gives us the opportunity to hold historic data in the table setup.

Our programs can then access the data through the view using date specifications (where statements) or they can access the delta table directly if more advanced handling of data is necessary.

By using near-real-time updating we are transferring batch processing of large amounts of data at night to smaller amounts scattered over the day, thus eliminating the need for heavy batch processing at night. So we are potentially looking at reconfiguring BI servers because the work load will look different in the future.

More to come...

Thanks for the update!

Probably not more to come, I'm sorry...

I have switched fokus and I'm now working with CI and AI integration.

I hope the delta solution has given some inspiration as how to handle large amounts of data in a DWH setup.

Version history
Last update:
‎10-05-2015 03:58 PM
Updated by:

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

Free course: Data Literacy Essentials

Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning  and boost your career prospects.

Get Started

Article Labels
Article Tags