Using SAS DI Studio To Load A Data Vault

12 Likes

Introduction

Data Vault has over the last years becoming popular as modelling technique for data warehouse. My experience is that it has not been used widely (yet) within SAS implementations. So my questions are: Could you use SAS Data Integration for this? Is it recommended?

This article is targeted to audiences with little/no knowledge of Data Vault, nor SAS Data Integrations Studio. So bare with me if you feel that some parts seem abundantly clear.

What is SAS Data Integration Studio?

SAS Data Integration Studio is an ETL tool offered by SAS Institute, and is a part of their Data Management portfolio. It lets you build and maintain metadata for databases, entities and jobs. It contains a set of standard transformations that help you with copy, map, transform and load your data. The metadata is the basis for deploying job code, which is in SAS Language. The SAS Language is a combination of:

ANSI SQL with SAS extensions
Vendor RDBMS specific SQL
Data step language
Extensive function library
Procedures for standard data and statistical tasks
Macro language to create parametrized and reusable code

What is Data Vault?

Data Vault is a data modelling paradigm, first presented by Dan Linstedt. It is aimed for the atomic layer of a data warehouse. And it was developed to answer some of the problems that modelling a data warehouse using the Third normal form (3NF).

The simplest description on what Data Vault is, is to describe the three core types of entities:

Hub: identifies a business concept. Contains only the business key for that concept (alongside with standard technical attributes such as Load Time, Source System Code and of course a surrogate key (SK)). No versioning – no valid from/to dates.
Link: establishes relationships between hubs. Always describes M-M, regardless of the nature of current real world relationship. No versioning.
Satellite: contains descriptive data; codes, names, numerical values, business dates etc. This data is always versioned. A Satellite can only have a relationship to one mother table: a hub or a link (less common).

There is also an expressed best practice that you can have two persistent layers in a data vault warehouse:

Core Data Vault: Reads all data, stores all data. No transformation other than maintaining business keys, and mapping to common column names and data types. Considered as the “fact” of the data warehouse
Business Data Vault: also an atomic layer, but is the result after applying common business and transformation rules. These could include calculations, common code structures, data quality initiatives, customer segmentation and other statistical models.

Why Use Data Vault?

There are several benefits of using the Data Vault (DV) paradigm over 3NF. Some of them are (not limited to):

Flexible: when adding new data sources to your DW, usually no need to redesign existing tables – just add new
Agile: the flexibility lets you build small portions in each sprint. Extensions to the model usually means separate Hubs and Links.
Predictable loading pattern. 3NF DW usually needs many waves to load, so it can maintain referential integrity. As an example, the SAS model for the Insurance industry needs 23 waves. A Data Vault only needs 3! How can that be - there is no exact corresponding version of the SAS Insurance industry model? Well because that a Data Vault you can always load the data in these three waves:

Hubs – create hub surrogate keys based on business keys
Links (need hub surrogate keys as FK, and potentially to build a link surrogate key), and Hub Satellites (need the Hub PK which will form the Satellite PK with a load timestamp or a valid from date/date time)
Link Satellites: need the link surrogate key, which will form the PK together with a date/date time stamp.

If a cardinality of a relationship changes, do nothing. The links are designed to handle 1-M, M-1 and M-M.
No need to tweak referential integrity by not promoting the whole PK as FK (usually the valid from date/date time is not promoted)

Data Vault and ETL

What I will refer to in this document is the loading part of a DW. Extract will probably be quite similar to an ordinary DW. For transformation, there is not much of it in a Core Data Vault, but a bit more in a Business Data Vault.

As you might have guessed from the what is described above, these are the main activities when loading a Data Vault:

Create surrogate keys
For hubs, based on a business key
For links, based on the surrogate keys (SK) from the participating hubs
Look-up surrogate keys
For links, the hub SKs
For hub satellites, the hub SK
For link satellites, the hub SK, then the link SK
Versioning, create generations of records when attributes change. This only done for satellites.

Sample Data Vault Model

This is a sample model that we will use for our example DI Studio load jobs. The different entities are in the de facto colouring standard. This is just a small part of the whole model, and just a subset of attributes are visible. So we can focus on principles, rather than details.

Customer and Account are core business concepts, and represented as hubs. They are mandatory.

The link represent a relation between an account and a customer. But a customer can have multiple account, and a an account can have (model wise) multiple customers connected to it.

The satellites is used to store attributes that describes the core business concepts, or the relationship (links).

How Do Can We Use SAS DI Studio for Loading A Data Vault?

Loading a Hub

We map 1-1 from a staging table that contain the business key for the business concept Customer. For private customers this is typically some kind of SSN, if available. The technical column Source System Code is being mapped here, but it could be done in the stage table if that seems more appropriate. The surrogate key and Load date time columns are generated (out-of-the box) in the SCD Type 1 transformation.

As an alternative, you could use the Surrogate Key Generator transformation. But it generates an output of all records (including existing), which you need to take care of separately in a succeeding Table Loader step. Surrogate Key Generator is a so called “Generated Transform” which means that you could customize it as an ETL developer by using SAS Language.

Loading a Satellite

Here’s an example of loading a satellite table for Customer. The first step is to get the hub SK by using the business key in a Look-up. The SCD Type 2 Loader transformation handles versioning of data. You should not generate new SK in the satellite. This is because you should never have a Satellite referencing another table directly. Use the Customer hub SK as a business key in the SCD Type 2 transformation.

Loading a Link

Loading a link needs typically two steps. First look-up the SK’s for the contributing hubs, by using their business keys.

Then use the SCD Type 1 Loader to manage creation of new link records, and the assigning them a link SK if needed. Why does the link need a SK? If you need to add some attributes to a link, they will be stored in satellite connected to the link. Then the Link SK is needed as FK in that satellite.

The separate Table Loader will capture those records with where there is no match in the key Lookup. You can chose in the Lookup how to handle these exceptions (move record to exceptions table, abort job etc).

Loading a Link Satellite

For completeness, let me show me this case as well, since it differs a bit from loading a satellite connected to a hub. First you need to get the contributing hubs SK’s, by using respective business key. Then, by combining them you will be able to look-up the link SK.

And then, like a hub satellite, use SCD Type 2 Loader to handle versioning of the satellite attributes.

Of course you could/should have exception management in this job as well, but I omitted it to focus on the core components of the job.

Conclusion

SAS Data Integration Studio has no specific support for loading a Data Vault. But the Data Vault loading logic is simple, so Data Integration Studio standard transformations can be used as is. The three transformations SCD Type 1, SCD Type 2 and the Look-up transformations is enough to load a standard Data Vault model. If you have large volumes of non-changeable transactional data (topic not discussed here), the Table Loader transformation can be used for straight inserts/appends.

References

Data Vault

http://danlinstedt.com/

http://hanshultgren.wordpress.com/

SAS Data Integration Studio

http://support.sas.com/software/products/etls/index.html

AnnaBrown · ‎10-22-2014

Thank you for posting this informative article, LinusH! I'm sure that the community will find it very useful.

Other members, was this helpful? Comment below or rate it using the "star" system.

PaulVanMol · ‎10-24-2014

Yes very helpful! I would include a sas package with metadata and a data model so DI Studio users can have an example data model with DI Jobs as example.

LinusH · ‎11-03-2014

How could I forget a data model when trying to explain data vault?:smileyblush::smileyconfused:

Thanks for the reminder!

Package may be available by direct request.

_thesasgeek · ‎11-14-2014

Many thanks for this post Linus.

I am struggling to find out how to PM you on this sure to ask for the package to be emailed so I can look at the example above in more detail. I've added you on LinkedIn so hopefully we can correspond on there.

DHeijink · ‎06-14-2017

This is very helpful! I would also like the package, so I can try it out myself in SAS DI.

How can I contact you on this site? Can't find the message option when I go to you profile.

AnnaBrown · ‎06-14-2017

Hi DHeijink,

I'm glad you found this article helpful! If you go to LinusH's profile page, you'll see a "send this user a private message" option on the right hand side.

Best,

Anna

DHeijink · ‎06-15-2017

Aah, how could I miss that! Thankyou AnnaBrown.