In a previous post on SAS Data Studio, I showed how you could use the Code transform to create a Unique Identifier and Cluster records. SAS Data Studio 2.2 which is powered by SAS Viya 3.4, now has built-in transforms to perform these functions and you no longer need to write your own code.
In this article, I will introduce you to the two new transforms in SAS Data Studio that perform this functionality:
For my examples, I am using the following list of contact records as the source table for my data preparation plan in SAS Data Studio.
As part of my data preparation process I would like to generate a unique identifier for each of the rows in the table. To do this, I select the new Unique identifier transform in the Row Transforms section and select the option to create a new column called Unique_ID. I also have the option to replace an existing column with generated unique identifier.
The results of running the Unique identifier transform on my contact list table are shown below:
Continuing with the contact records example, I want to match and cluster my records on the following conditions:
Name, Address, and Zip
OR
Name and Phone.
Prior to applying these matching rules, I will perform some data quality operations on the data in order to achieve better matching results.
First, I generate some match codes for the fields Name, Address, and Zip to facilitate the fuzzy matching of the information in these columns. I used the Matchcodes transform in the Data Quality Transforms section to generate these matchcodes.
Next, I standardize the Phone field using the Standardize transform in the Data Quality Transforms section.
Now that I have prepared my data for my matching purposes, I select the new Match and cluster transform in the Data Quality Transforms section. I have called the new column Cluster_ID and added my matching conditions as shown in the screenshot below:
The Match and cluster transform has some advanced options you can select:
The results of running of the Match and cluster transform on my table are displayed below:
Notice that all the variations of the "Susan Woodard" records are assigned the same Cluster_ID value as are the "James Briggs" records.
With the addition of these two new transforms in SAS Data Studio 2.2. users no longer need to write custom code to achieve these common data preparation functions. For more information on the new release of SAS Data Studio 2.2, you can refer to the documentation for SAS Viya 3.4: Data Preparation.
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.