Let's paint a picture: you're attempting to remove the duplicate rows from a data set. Deduplication is a common data transformation task that can be completed with a variety of methods: PROC SORT, the Remove Duplicates flow step, adding the DISTINCT keyword to a SELECT statement in SQL, and so on. However, that task gets more complicated when you aren't removing exact duplicates. What if there are differences in formatting between duplicate rows, like alternate spellings of a name? What if one row contains useful column values that are missing in another row, even though they refer to the same entity? You can solve this problem with data survivorship–and you can do it in a point-and-click flow!
In this post, I'll show you how to use the Match Codes step with the Survivorship code snippet in SAS Studio to perform seamless entity resolution. I'll deduplicate and combine some complex customer data to ensure that all relevant information is stored in one surviving record per customer. Keep reading to learn how to take advantage of these tools in a SAS Studio flow!
The Match Codes Step
The Match Codes step enables you to create encoded representations of text strings called match codes. Match codes can be used to group similar data values together or join similar records across multiple tables.
Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.
In a previous post, Detailing your Data in SAS Studio Part 4: Matching the Meaningful Data, I expand further on the Match Codes step and show an example of using match codes to join two tables together in a flow. Here, I'll be showing an example of grouping and condensing data values together within one dataset. For more information on the Match Codes step and its capabilities, visit the documentation.
The Match Codes step is available with the SAS Studio Engineer license.
The Survivorship Code Snippet
Survivorship is the process of combining or selecting data to create one "surviving" record per cluster of similar records. Survivorship rules are defined to select which values to retain for specific variables, and the output is a single completed record.
The Survivorship snippet uses the %DQSURVR autocall macro to identify and compose a surviving record. The required parameters include the input and output tables, the cluster column, and a row rule to identify the surviving record. To further customize your survivorship process, you can also specify additional row rules, column rules, and more. Visit the documentation to see a full list of macro parameters and examples.
The Survivorship snippet is one of many data quality snippets available in SAS Studio with SAS Viya. For more information on working with code snippets in general, refer to the documentation.
Scenario
In this post, I'll be using a dataset I created called CUSTOMERS, which has 60 rows and 9 columns. This table has several data quality issues. Notably, each customer appears three times, with each row storing different pieces of information about the customer. Some rows are missing the CustomerID key value. Other columns that could be used as unique identifiers, like Name and Address, do not have consistent values across similar rows. For example, the customer name Anna Christian is written as Anna R. Christian; CHRISTIAN, ANNA; and Christian, Anna R. in the data.
We'll deduplicate this data with our SAS Data Quality rules.
Generate Match Codes with Match Codes Step
First, we'll generate match codes to determine which rows represent the same customers. In my flow, I'll connect a Match Codes node to my source table and select the ENUSA QKB locale. Then, I'll configure the step to generate match codes for the Address1 variable using the Address match definition. I'll keep the default sensitivity of 85.
After running the flow, I can review the results.
Now, rows with similar Address1 values also have equivalent match codes regardless of each address's format. These match codes will be used in the next step.
Combine and Remove Duplicates with Survivorship Snippet
Next, we'll use the survivorship snippet to combine customer data into one surviving record per customer based on equivalent match code values.
In SAS Studio, I'll go into the Snippets pane and expand the Standard → Data Quality folders. From there, I can click and drag the Survivorship snippet onto the flow canvas and connect it to the Match Codes step.
The code snippet acts like a standard SAS Program step in the flow. When reviewing the snippet, I can see multiple examples of using the %DQSURVR autocall macro in addition to preliminary code that creates the example data and libraries. I can customize any of these examples to use my own data instead.
I'll edit the SAS Program step to keep only the first example that calls the %DQSURVR macro. Then, I'll edit the parameters as follows to meet my needs:
After editing the code, I'll run the flow again and review the results.
The output table has 20 rows and 5 columns. Duplicate records were accurately condensed to store all customer data in one record. I could continue to improve this table by standardizing and casing the variables with the Clean Data step, but I'll stop here for the scope of this post.
Summary
In this post, I demonstrated how to use two different types of SAS Data Quality features to perform entity resolution and create surviving records in SAS Studio Flows. If you want to learn more about using the Match Codes step, check out my post Detailing your Data in SAS Studio Part 4: Matching the Meaningful Data.
Have you used SAS code snippets before, or did you first learn about them here? Would the SAS Data Quality snippets be useful in your flows or programs? Share your thoughts, questions, and feedback below!
Find more articles from SAS Global Enablement and Learning here.
Nearly 200 sessions are now available on demand with the SAS Innovate Digital Pass.
Explore Now →The rapid growth of AI technologies is driving an AI skills gap and demand for AI talent. Ready to grow your AI literacy? SAS offers free ways to get started for beginners, business leaders, and analytics professionals of all skill levels. Your future self will thank you.