BookmarkSubscribeRSS Feed

Satisfying Data Survivorship with SAS Data Quality Steps and Snippets

Started yesterday by
Modified yesterday by
Views 61

Let's paint a picture: you're attempting to remove the duplicate rows from a data set. Deduplication is a common data transformation task that can be completed with a variety of methods: PROC SORT, the Remove Duplicates flow step, adding the DISTINCT keyword to a SELECT statement in SQL, and so on. However, that task gets more complicated when you aren't removing exact duplicates. What if there are differences in formatting between duplicate rows, like alternate spellings of a name? What if one row contains useful column values that are missing in another row, even though they refer to the same entity? You can solve this problem with data survivorship–and you can do it in a point-and-click flow!

 

In this post, I'll show you how to use the Match Codes step with the Survivorship code snippet in SAS Studio to perform seamless entity resolution. I'll deduplicate and combine some complex customer data to ensure that all relevant information is stored in one surviving record per customer. Keep reading to learn how to take advantage of these tools in a SAS Studio flow!

 

The Match Codes Step

 

The Match Codes step enables you to create encoded representations of text strings called match codes. Match codes can be used to group similar data values together or join similar records across multiple tables.

 

01_grbarn_survivorship1.png

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

 

In a previous post, Detailing your Data in SAS Studio Part 4: Matching the Meaningful Data, I expand further on the Match Codes step and show an example of using match codes to join two tables together in a flow. Here, I'll be showing an example of grouping and condensing data values together within one dataset. For more information on the Match Codes step and its capabilities, visit the documentation.

 

The Match Codes step is available with the SAS Studio Engineer license.

 

The Survivorship Code Snippet

 

Survivorship is the process of combining or selecting data to create one "surviving" record per cluster of similar records. Survivorship rules are defined to select which values to retain for specific variables, and the output is a single completed record.

 

02_grbarn_survivorship2.png

 

The Survivorship snippet uses the %DQSURVR autocall macro to identify and compose a surviving record. The required parameters include the input and output tables, the cluster column, and a row rule to identify the surviving record. To further customize your survivorship process, you can also specify additional row rules, column rules, and more. Visit the documentation to see a full list of macro parameters and examples.

 

The Survivorship snippet is one of many data quality snippets available in SAS Studio with SAS Viya. For more information on working with code snippets in general, refer to the documentation.

 

Scenario

 

In this post, I'll be using a dataset I created called CUSTOMERS, which has 60 rows and 9 columns. This table has several data quality issues. Notably, each customer appears three times, with each row storing different pieces of information about the customer. Some rows are missing the CustomerID key value. Other columns that could be used as unique identifiers, like Name and Address, do not have consistent values across similar rows. For example, the customer name Anna Christian is written as Anna R. Christian; CHRISTIAN, ANNA; and Christian, Anna R. in the data.

 

03_grbarn_survivorship3-1024x404.png

 

We'll deduplicate this data with our SAS Data Quality rules.

 

Generate Match Codes with Match Codes Step

 

First, we'll generate match codes to determine which rows represent the same customers. In my flow, I'll connect a Match Codes node to my source table and select the ENUSA QKB locale. Then, I'll configure the step to generate match codes for the Address1 variable using the Address match definition. I'll keep the default sensitivity of 85. 

 

04_grbarn_survivorship4.png

 

After running the flow, I can review the results.

 

05_grbarn_survivorship5-1024x410.png

 

Now, rows with similar Address1 values also have equivalent match codes regardless of each address's format. These match codes will be used in the next step.

 

Combine and Remove Duplicates with Survivorship Snippet

 

Next, we'll use the survivorship snippet to combine customer data into one surviving record per customer based on equivalent match code values.

 

In SAS Studio, I'll go into the Snippets pane and expand the Standard → Data Quality folders. From there, I can click and drag the Survivorship snippet onto the flow canvas and connect it to the Match Codes step.

 

06_grbarn_survivorship6-1024x683.png

 

The code snippet acts like a standard SAS Program step in the flow. When reviewing the snippet, I can see multiple examples of using the %DQSURVR autocall macro in addition to preliminary code that creates the example data and libraries. I can customize any of these examples to use my own data instead.

 

07_grbarn_survivorship7.png

 

I'll edit the SAS Program step to keep only the first example that calls the %DQSURVR macro. Then, I'll edit the parameters as follows to meet my needs:

 

  • inTable=&_input1
    • &_input1 is a macro variable which stores the input table name for a node. This option is used because the input table is a temporary table created by the Match Codes step, and temporary tables are given a random name by default.
  • outTable=&_output1
    • &_output1 is a macro variable which stores the output table name for a node. This option is used because we want to store this output in a temporary table generated by the SAS Program step, and temporary tables are given a random name by default.
  • clusterColumn=Address1_MC85
    • Address1_MC85 is the column that stores the equivalent values which cluster similar records based on Address1 match code values.
  • rowRule1=(longest, Name)
    • The surviving record will be chosen by the longest Name value, because this should be the most complete version of the customer's name. For example, "Barbara Ann Rhymes" is longer than "Barbie Rhymes".
  • firstColumnRule1=(not_missing, Gender)
    • The surviving record should store the first non-missing Gender value from the cluster records.
  • firstColumnRuleAppliedCols=(Birthday, Occupation, Company)
    • The surviving record should store the BirthdayOccupation, and Company values from the row satisfying firstColumnRule1. Notice in the original data that any rows with non-missing Gender values also have non-missing Birthday, Occupation, and Company values.
  • secondColumnRule1=(not_missing, Email)
    • The surviving record should store the first non-missing Email value from the cluster records.
  • Add the new parameter secondColumnRuleAppliedCols=(CustomerID, Phone)
    • The surviving record should store the CustomerID and Phone values from the row satisfying secondColumnRule1. Notice in the original data that any rows with non-missing Email values also have non-missing CustomerID and Phone values.
  • Add the new parameter generateDistinctSurvivor=1 (after keepDuplicates=0)
    • The output will include only one surviving record for each customer.

 

08_grbarn_survivorship8.png

 

After editing the code, I'll run the flow again and review the results.

 

09_grbarn_survivorship9-1-1024x492.png

 

The output table has 20 rows and 5 columns. Duplicate records were accurately condensed to store all customer data in one record. I could continue to improve this table by standardizing and casing the variables with the Clean Data step, but I'll stop here for the scope of this post.

 

Summary

 

In this post, I demonstrated how to use two different types of SAS Data Quality features to perform entity resolution and create surviving records in SAS Studio Flows. If you want to learn more about using the Match Codes step, check out my post Detailing your Data in SAS Studio Part 4: Matching the Meaningful Data.

 

Have you used SAS code snippets before, or did you first learn about them here? Would the SAS Data Quality snippets be useful in your flows or programs? Share your thoughts, questions, and feedback below!

 

 

Find more articles from SAS Global Enablement and Learning here.

Contributors
Version history
Last update:
yesterday
Updated by:

Catch up on SAS Innovate 2026

Nearly 200 sessions are now available on demand with the SAS Innovate Digital Pass.

Explore Now →

SAS AI and Machine Learning Courses

The rapid growth of AI technologies is driving an AI skills gap and demand for AI talent. Ready to grow your AI literacy? SAS offers free ways to get started for beginners, business leaders, and analytics professionals of all skill levels. Your future self will thank you.

Get started

Article Tags