SAS Studio flows are really starting to come on strong with support for a variety of Data Management operations with the recent addition of SAS Studio Engineer steps. This blog explains a design for a no-code SAS Studio flow that fully supports automation to update records in dimension tables and in related fact tables while implementing Slowly Changing Dimensions (SCD) for preservation of historical dimension records.
The Implement SCD allows you to save historical records for a given business entity in a dimensional table. Saving current and historical records in dimensional tables is referred to as slowly changing dimensions. As an example, in a customer dimension table you may want to update a customer's address but maintain a record for the customer's old address. This could be useful if you have a related CUSTOMERORDERS fact table and you want to relate older orders to the old customer address and new orders to the new address.
To store both historical and current records for a dimensional entity (such as customers) you can no longer use the unique identifier for each customer (let's call the identifier CUSTOMER_ID) as the primary key in the dimension table related to foreign keys in fact tables. The CUSTOMER_ID will remain in the table to identify unique customers but you also need to generate and store a unique surrogate key that is unique for each record in the slowly changing dimension table. You will also need to include that unique surrogate key as the foreign key in the records of related fact tables.
In her post earlier this year MK Queen showed how to use the Implement SCD step in a SAS Studio flow in order to create and maintain such a dimensional table structure. Please refer to her blog for background details on using the Implement SCD step. Here I will extend that discussion to present a design for a SAS Studio flow that:
This is a view of the flow design with numbers superimposed to identify each item in the flow:
Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.
Before going into the details let's briefly overview the flow design. The numbers below correspond to the numbered items in the flow diagram displayed above:
This flow is designed so that in each new month (or any time period), as new customerchanges and/or new orders are available as source data in item 1 and/or item 5, the flow can be re-executed to update the customer dimension table (CUSTOMERDIMSCD) and the order fact table (CUSTORDERSCD). Also, if the flow is unintentionally executed for the same source records, no duplicate records will be unintentionally added to the dimension or fact tables. Apart from the source files (items 1 and 5 in the flow) all other tables in the flow are Oracle tables for two reasons:
The design presented here includes a single dimension table but the design could be repeated for additional dimensional tables in star-schema relational models.
Let's look at some of the details for each numbered item and follow the process that occurs as the flow is executed for a sample set of incoming source records and existing records in the CUSTOMERDIMSCD table. A very small number of records are used in these examples to facilitate illustrating the process, although the method discussed supports any volume of data.
The customerchangesM1.csv file contains the following 2 records representing customer updates for the current month (month1):
These are updates that must be applied in the CUSTOMERDIMSCD table that currently contains the following records:
The first record in customerchangesM1.csv is an update for CUSTOMERID=1 in the CUSTOMERDIMSCD table and the second record in customerchangesM1.csv is a new customer to be appended to the CUSTOMERDIMSCD table.
The Import step accesses the customerchangesM1.csv file and parses it into fields using the comma to delimit records and maps them to columns in the target table.
The target table for the Import is defined as an Oracle table by selecting ORALIB, which is a SAS library defined to connect to the Oracle database. The table name is CUSTOMERCHANGES. If this table already exists, it will be replaced with the data from the current source for the import task.
The Implement SCD step used the CUSTOMERDIMCHANGES Oracle table produced by the Import step as a source table and the CUSTOMERDIMSCD table is defined I the ImplementSCD properties as the target table:
The Implement SCD step is configured to maintain historical records as the CUSTOMERDIMSCD table is updated. I will describe this briefly here and for more detail see the post by MK Queen.
CUSTOMERID is defined as the business key to match source and target columns for updates:
Three columns in are identified for historical change tracking:
The columns in the CUSTOMERDIMSCD target table used for change tracking are identified:
The surrogate key that identifies unique records in CUSTOMERDIMSCD is identified:
When the Implement SCD step is executed, the CUSTOMERDIMSCD target table is updated:
The prior current record for CUSTOMERID=1 is no longer the current record (CURRENT_FLAG=0) and a new current record has been appended (CURRENT_FLAG=1). A record has also been added for CUSTOMER_ID=5. This customer did not exist in the table before the update.
OrdersM1 is a comma delimited text file contains the following 2 records representing orders in the current month (month1):
The CUSTOMERID business key matches the CUSTOMERID in the CUSTOMERDIMSCD. These are new orders that need to be added to the ORDERFACTSCD table and when we add them we need to include the CUSTOMER_HISTORYID primary key to relate them to the matching current record for the same customer in the CUSTOMERDIMSCD table.
The Import step accesses the ordersM1.csv file and parses it into fields using the comma to delimit records and maps them to columns in the target table.
The target table for the Import is defined as an Oracle table by selecting ORALIB, which is a SAS library defined to connect to the Oracle database. The table name is RECENTORDERS. If this table already exists, it will be replaced with the data from the current source for the import task.
The join step does a right join of CUSTOMERDIMSCD with the RECENTORDERS table using CUSTOMER_ID as the join key:
To make sure that the RECENTORDERS record match with only current customer records in CUSTOMERDIMSCD (and not historical records) a filter is added to the join to include only records from CUSTOMERDIMSCD where CURRENT_FLAG=1:
All the columns from RECENTORDERS are selected as well as the CUSTOMER_HISTORYID from the CUSTOMERORDERSCD table:
The target table for the join is saved in Oracle by specifying the Oracle library (ORALIB). The table is called RECENTWITHHISTID:
When the join is executed for the current CUSTOMERDIMSCD and RECENTORDERS tables the target RECENTWITHHISTID table contains these records:
Records in the RECENTWITHHIST table can now be added to the CUSTOMORDSCD Oracle table using the Merge Table step. More details for the Merge Table step can be found in the video Data Integration with the Merge Table Step posted by Grace Barnhill. In brief, the Merge step using an SQL Merge method to update rows in a target table based on a source table using key column matching. Rows from the source can be used to update matching target rows and/or rows in the source that do not match rows in the target can be appended to the target.
The target table for the Merge Table step will be an Oracle table in the ORALIB library called CUSTORDERSCD:
We will match source record with target records on the Key column ORDERID:
We will only Insert rows that DO NOT MATCH and we will add all column values from the source to the target:
We do not update any rows when a matching ORDERID is found. We do not need to update rows because we only receive new orders from the source:
Configured this way, the Merge Tables step assures that no duplicate orders will be added to the CUSTORDERSCD table, even if we execute the flow more than once for the same source data for new orders.
The source files (items 1 and 5 in the flow diagram) are File objects. In the properties for those items you can change the file that they access by clicking the folder icon to the right of the File location field:
You can also update the same source files external to the flow and re-run the flow without changing the source files used in the flow. Such an approach would allow you to schedule the SAS Studio flows to execute automatically as source data is updated. For documentation on scheduling SAS Studio flows see Scheduling a Program, Task, Query, or Flow as a Job and Using the Scheduled Jobs Window.
In this flow design you can change one or both file sources independently and re-execute the flow because the Implement SCD step will only update existing CUSTOMERDIMSCD records when a change in a change-tracked column occurs and the Merge Step will only append new orders to the CUSTORDERSCD table. If the flow is inadvertently executed for any source data that has already been processed in an earlier run of the flow, those records will not affect the CUSTOMERDIMSCD or CUSTORDERSCD tables.
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9.
Early bird rate extended! Save $200 when you sign up by March 31.
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.