Data Vault has over the last years becoming popular as modelling technique for data warehouse. My experience is that it has not been used widely (yet) within SAS implementations. So my questions are: Could you use SAS Data Integration for this? Is it recommended?
This article is targeted to audiences with little/no knowledge of Data Vault, nor SAS Data Integrations Studio. So bare with me if you feel that some parts seem abundantly clear.
SAS Data Integration Studio is an ETL tool offered by SAS Institute, and is a part of their Data Management portfolio. It lets you build and maintain metadata for databases, entities and jobs. It contains a set of standard transformations that help you with copy, map, transform and load your data. The metadata is the basis for deploying job code, which is in SAS Language. The SAS Language is a combination of:
Data Vault is a data modelling paradigm, first presented by Dan Linstedt. It is aimed for the atomic layer of a data warehouse. And it was developed to answer some of the problems that modelling a data warehouse using the Third normal form (3NF).
The simplest description on what Data Vault is, is to describe the three core types of entities:
There is also an expressed best practice that you can have two persistent layers in a data vault warehouse:
There are several benefits of using the Data Vault (DV) paradigm over 3NF. Some of them are (not limited to):
What I will refer to in this document is the loading part of a DW. Extract will probably be quite similar to an ordinary DW. For transformation, there is not much of it in a Core Data Vault, but a bit more in a Business Data Vault.
As you might have guessed from the what is described above, these are the main activities when loading a Data Vault:
This is a sample model that we will use for our example DI Studio load jobs. The different entities are in the de facto colouring standard. This is just a small part of the whole model, and just a subset of attributes are visible. So we can focus on principles, rather than details.
Customer and Account are core business concepts, and represented as hubs. They are mandatory.
The link represent a relation between an account and a customer. But a customer can have multiple account, and a an account can have (model wise) multiple customers connected to it.
The satellites is used to store attributes that describes the core business concepts, or the relationship (links).
How Do Can We Use SAS DI Studio for Loading A Data Vault?
We map 1-1 from a staging table that contain the business key for the business concept Customer. For private customers this is typically some kind of SSN, if available. The technical column Source System Code is being mapped here, but it could be done in the stage table if that seems more appropriate. The surrogate key and Load date time columns are generated (out-of-the box) in the SCD Type 1 transformation.
As an alternative, you could use the Surrogate Key Generator transformation. But it generates an output of all records (including existing), which you need to take care of separately in a succeeding Table Loader step. Surrogate Key Generator is a so called “Generated Transform” which means that you could customize it as an ETL developer by using SAS Language.
Here’s an example of loading a satellite table for Customer. The first step is to get the hub SK by using the business key in a Look-up. The SCD Type 2 Loader transformation handles versioning of data. You should not generate new SK in the satellite. This is because you should never have a Satellite referencing another table directly. Use the Customer hub SK as a business key in the SCD Type 2 transformation.
Loading a link needs typically two steps. First look-up the SK’s for the contributing hubs, by using their business keys.
Then use the SCD Type 1 Loader to manage creation of new link records, and the assigning them a link SK if needed. Why does the link need a SK? If you need to add some attributes to a link, they will be stored in satellite connected to the link. Then the Link SK is needed as FK in that satellite.
The separate Table Loader will capture those records with where there is no match in the key Lookup. You can chose in the Lookup how to handle these exceptions (move record to exceptions table, abort job etc).
For completeness, let me show me this case as well, since it differs a bit from loading a satellite connected to a hub. First you need to get the contributing hubs SK’s, by using respective business key. Then, by combining them you will be able to look-up the link SK.
And then, like a hub satellite, use SCD Type 2 Loader to handle versioning of the satellite attributes.
Of course you could/should have exception management in this job as well, but I omitted it to focus on the core components of the job.
SAS Data Integration Studio has no specific support for loading a Data Vault. But the Data Vault loading logic is simple, so Data Integration Studio standard transformations can be used as is. The three transformations SCD Type 1, SCD Type 2 and the Look-up transformations is enough to load a standard Data Vault model. If you have large volumes of non-changeable transactional data (topic not discussed here), the Table Loader transformation can be used for straight inserts/appends.