BookmarkSubscribeRSS Feed

Making History with SAS Studio Flows: Managing Slowly Changing Dimensions in a Star Schema

Started ‎12-19-2023 by
Modified ‎12-19-2023 by
Views 304

 

SAS Studio flows are really starting to come on strong with support for a variety of Data Management operations with the recent addition of SAS Studio Engineer steps. This blog explains a design for a no-code SAS Studio flow that fully supports automation to update records in dimension tables and in related fact tables while implementing Slowly Changing Dimensions (SCD) for preservation of historical dimension records.

Background information on slowly changing dimensions and the Implement SCD step in SAS Studio

The Implement SCD allows you to save historical records for a given business entity in a dimensional table. Saving current and historical records in dimensional tables is referred to as slowly changing dimensions. As an example, in a customer dimension table you may want to update a customer's address but maintain a record for the customer's old address. This could be useful if you have a related CUSTOMERORDERS fact table and you want to relate older orders to the old customer address and new orders to the new address.

 

To store both historical and current records for a dimensional entity (such as customers) you can no longer use the unique identifier for each customer (let's call the identifier CUSTOMER_ID) as the primary key in the dimension table related to foreign keys in fact tables. The CUSTOMER_ID will remain in the table to identify unique customers but you also need to generate and store a unique surrogate key that is unique for each record in the slowly changing dimension table. You will also need to include that unique surrogate key as the foreign key in the records of related fact tables.

 

In her post earlier this year MK Queen showed how to use the Implement SCD step in a SAS Studio flow in order to create and maintain such a dimensional table structure. Please refer to her blog for background details on using the Implement SCD step. Here I will extend that discussion to present a design for a SAS Studio flow that:

  1. uses the Implement SCD step
  2. includes additional steps to maintain the primary- key/foreign-key relationship between an SCD dimensional table and a related fact table as both dimensional records and fact records are added over time.

Overview of the flow design

This is a view of the flow design with numbers superimposed to identify each item in the flow:

 

ghan_dec2023_1.png

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

 

Before going into the details let's briefly overview the flow design. The numbers below correspond to the numbered items in the flow diagram displayed above:

  1. CustomerchangesM1 is a comma separated text files containing either new customers for month1 or existing customers with changes effective this month. These record will be used to update a slowly changing customer dimension table called CUSTOMERDIMSCD.
  2. The Import step imports the data in CustomerchangesM1 to a target table.
  3. The target table for the import step is an Oracle table called CUSTOMERCHANGES. If the Oracle table already exists, it will be replaced with the data from the customerchange1 text file. Oracle tables are used in this flow because the Implement SCD step supports source and target tables from Oracle as well as a few other databases (and does not support SAS datasets). Using Oracle tables in each step maximizes in-database across all steps in the flow.
  4. The Implement SCD step uses the CUSTOMERCHANGES Oracle table to update an Oracle table called CUSTOMERDIMSCD – a customer dimension table that retains customer history. The target table is represented by the circular part of the Implement SCD step icon and does not show up as a separate object in the flow. The CUSTOMERDIMSCD table contains the customer_ID business key for each customer and also a generated surrogate key called CUSTOMER_HISTORYID as a unique primary key value.
  5. Orders1 is a comma separated text files containing orders received during month 1 for customers identified with the CUSTOMER_ID business key. These records need to be appended to an OrderFact table.
  6. This Import step imports the data in orders1 to a target table.
  7. The target table for this import step is an Oracle table called RECENTORDERS.
  8. A Query step uses a join to match each order record in the RECENTORDERS table with the current record for the matching CUSTOMER_ID in the CUSTOMERDIMSCD table and adds the CUSTOMER_HISTORYID key for that matching record to the RECENTORDER records.
  9. RECENTWITHHISTID is the target table produced by the join. It contains each order from RECENTORDERS plus the foreign key CUSTOMER_HISTORYID values that match the corresponding primary key in CUSTOMERDIMSCD.
  10. A Merge Table step is used to append the new recent order records from RECENTWITHHISTID to a target order fact table called CUSTORDERSCD.

This flow is designed so that in each new month (or any time period), as new customerchanges and/or new orders are available as source data in item 1 and/or item 5, the flow can be re-executed to update the customer dimension table (CUSTOMERDIMSCD) and the order fact table (CUSTORDERSCD). Also, if the flow is unintentionally executed for the same source records, no duplicate records will be unintentionally added to the dimension or fact tables. Apart from the source files (items 1 and 5 in the flow) all other tables in the flow are Oracle tables for two reasons:

  • the Implement SCD step and the Merge Table step require that both source and target tables must be stored in a supported database and Oracle is one such supported database
  • when the source and target tables are in the same database, the Implement SCD step, the Merge Table step, and the Join step are all execute fully in-database without any data being returned to SAS and without any data being processed by SAS. This minimized network transport of data, speeds the time of execution, and maximizes overall efficiency of the process.

The design presented here includes a single dimension table but the design could be repeated for additional dimensional tables in star-schema relational models.

Details on the flow design

Let's look at some of the details for each numbered item and follow the process that occurs as the flow is executed for a sample set of incoming source records and existing records in the CUSTOMERDIMSCD table. A very small number of records are used in these examples to facilitate illustrating the process, although the method discussed supports any volume of data.

Item 1

The customerchangesM1.csv file contains the following 2 records representing customer updates for the current month (month1):

 

ghan_dec2023_2.png

 

These are updates that must be applied in the CUSTOMERDIMSCD table that currently contains the following records:

 

ghan_dec2023_3.png

 

The first record in customerchangesM1.csv is an update for CUSTOMERID=1 in the CUSTOMERDIMSCD table and the second record in customerchangesM1.csv is a new customer to be appended to the CUSTOMERDIMSCD table.

Item 2

The Import step accesses the customerchangesM1.csv file and parses it into fields using the comma to delimit records and maps them to columns in the target table.

Item 3

The target table for the Import is defined as an Oracle table by selecting ORALIB, which is a SAS library defined to connect to the Oracle database. The table name is CUSTOMERCHANGES. If this table already exists, it will be replaced with the data from the current source for the import task.

 

ghan_dec2023_4.png

 Item 4

The Implement SCD step used the CUSTOMERDIMCHANGES Oracle table produced by the Import step as a source table and the CUSTOMERDIMSCD table is defined I the ImplementSCD properties as the target table:

 

ghan_dec2023_5.png

 

The Implement SCD step is configured to maintain historical records as the CUSTOMERDIMSCD table is updated. I will describe this briefly here and for more detail see the post by MK Queen.

 

CUSTOMERID is defined as the business key to match source and target columns for updates:

 

ghan_dec2023_6.png

 

Three columns in are identified for historical change tracking:

 

ghan_dec2023_7.png

 

The columns in the CUSTOMERDIMSCD target table used for change tracking are identified:

 

ghan_dec2023_8.png

 

The surrogate key that identifies unique records in CUSTOMERDIMSCD is identified:

 

ghan_dec2023_9.png

 

When the Implement SCD step is executed, the CUSTOMERDIMSCD target table is updated:

 

ghan_dec2023_10.png

 

The prior current record for CUSTOMERID=1 is no longer the current record (CURRENT_FLAG=0) and a new current record has been appended (CURRENT_FLAG=1). A record has also been added for CUSTOMER_ID=5. This customer did not exist in the table before the update.

Item 5

OrdersM1 is a comma delimited text file contains the following 2 records representing orders in the current month (month1):

 

ghan_dec2023_11.png

 

The CUSTOMERID business key matches the CUSTOMERID in the CUSTOMERDIMSCD. These are new orders that need to be added to the ORDERFACTSCD table and when we add them we need to include the CUSTOMER_HISTORYID primary key to relate them to the matching current record for the same customer in the CUSTOMERDIMSCD table.

Item 6

The Import step accesses the ordersM1.csv file and parses it into fields using the comma to delimit records and maps them to columns in the target table.

Item 7

The target table for the Import is defined as an Oracle table by selecting ORALIB, which is a SAS library defined to connect to the Oracle database. The table name is RECENTORDERS. If this table already exists, it will be replaced with the data from the current source for the import task.

 

ghan_dec2023_12.png

 Item 8

The join step does a right join of CUSTOMERDIMSCD with the RECENTORDERS table using CUSTOMER_ID as the join key:

 

ghan_dec2023_13.png

 

To make sure that the RECENTORDERS record match with only current customer records in CUSTOMERDIMSCD (and not historical records) a filter is added to the join to include only records from CUSTOMERDIMSCD where CURRENT_FLAG=1:

 

ghan_dec2023_14.png

 

All the columns from RECENTORDERS are selected as well as the CUSTOMER_HISTORYID from the CUSTOMERORDERSCD table:

 

ghan_dec2023_15.png

 

Item 9

The target table for the join is saved in Oracle by specifying the Oracle library (ORALIB). The table is called RECENTWITHHISTID:

 

ghan_dec2023_16.png

 

When the join is executed for the current CUSTOMERDIMSCD and RECENTORDERS tables the target RECENTWITHHISTID table contains these records:

 

ghan_dec2023_17.png

 

Item 10

Records in the RECENTWITHHIST table can now be added to the CUSTOMORDSCD Oracle table using the Merge Table step. More details for the Merge Table step can be found in the video Data Integration with the Merge Table Step posted by Grace Barnhill. In brief, the Merge step using an SQL Merge method to update rows in a target table based on a source table using key column matching. Rows from the source can be used to update matching target rows and/or rows in the source that do not match rows in the target can be appended to the target.

 

The target table for the Merge Table step will be an Oracle table in the ORALIB library called CUSTORDERSCD:

 

ghan_dec2023_18.png

 

We will match source record with target records on the Key column ORDERID:

 

ghan_dec2023_19.png

 

We will only Insert rows that DO NOT MATCH and we will add all column values from the source to the target:

 

ghan_dec2023_20.png

 

We do not update any rows when a matching ORDERID is found. We do not need to update rows because we only receive new orders from the source:

 

ghan_dec2023_21.png

 

Configured this way, the Merge Tables step assures that no duplicate orders will be added to the CUSTORDERSCD table, even if we execute the flow more than once for the same source data for new orders.

Automation: re-executing the flow when new source data is available

The source files (items 1 and 5 in the flow diagram) are File objects. In the properties for those items you can change the file that they access by clicking the folder icon to the right of the File location field:

 

ghan_dec2023_22.png

 

You can also update the same source files external to the flow and re-run the flow without changing the source files used in the flow. Such an approach would allow you to schedule the SAS Studio flows to execute automatically as source data is updated. For documentation on scheduling SAS Studio flows see Scheduling a Program, Task, Query, or Flow as a Job and Using the Scheduled Jobs Window.

 

In this flow design you can change one or both file sources independently and re-execute the flow because the Implement SCD step will only update existing CUSTOMERDIMSCD records when a change in a change-tracked column occurs and the Merge Step will only append new orders to the CUSTORDERSCD table. If the flow is inadvertently executed for any source data that has already been processed in an earlier run of the flow, those records will not affect the CUSTOMERDIMSCD or CUSTORDERSCD tables.  

Version history
Last update:
‎12-19-2023 03:23 PM
Updated by:
Contributors

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

Free course: Data Literacy Essentials

Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning  and boost your career prospects.

Get Started

Article Labels
Article Tags