BookmarkSubscribeRSS Feed

SAS Data Studio 2.2: New Transforms for Unique Identifier & Match and Cluster

Started ‎08-02-2018 by
Modified ‎08-13-2018 by
Views 1,958

In a previous post on SAS Data Studio, I showed how you could use the Code transform to create a Unique Identifier and Cluster records.  SAS Data Studio 2.2 which is powered by SAS Viya 3.4, now has built-in transforms to perform these functions and you no longer need to write your own code.

 

In this article, I will introduce you to the two new transforms in SAS Data Studio that perform this functionality:

  • Unique identifier – enables you to create a column in the target table that contains a value to uniquely identify each row in the table. These unique row identifiers are used in text topics in SAS Visual Analytics.
  • Match and cluster – enables you to match columns according to clustering rules and conditions resulting in the matching records receiving the same cluster ID for grouping purposes. Note: To use this transform you must be licensed for SAS Data Preparation.

 

For my examples, I am using the following list of contact records as the source table for my data preparation plan in SAS Data Studio.

 

1_SourceData.png

 

Unique Identifier Transform Example

As part of my data preparation process I would like to generate a unique identifier for each of the rows in the table.  To do this, I select the new Unique identifier transform in the Row Transforms section and select the option to create a new column called Unique_ID. I also have the option to replace an existing column with generated unique identifier.

 

2_UniqueIDTransform.png

 

The results of running the Unique identifier transform on my contact list table are shown below:

 

3_UniqueIDTransformResults.png

 

Match and Cluster Transform Example

Continuing with the contact records example, I want to match and cluster my records on the following conditions:

Name, Address, and Zip

OR

Name and Phone.

 

Prior to applying these matching rules, I will perform some data quality operations on the data in order to achieve better matching results.

 

First, I generate some match codes for the fields Name, Address, and Zip to facilitate the fuzzy matching of the information in these columns. I used the Matchcodes transform in the Data Quality Transforms section to generate these matchcodes.

 

4_Matchcodes.png

 

Next, I standardize the Phone field using the Standardize transform in the Data Quality Transforms section.

 

5_Standardize.png

 

Now that I have prepared my data for my matching purposes, I select the new Match and cluster transform in the Data Quality Transforms section. I have called the new column Cluster_ID and added my matching conditions as shown in the screenshot below:

 

6_MatchAndClusterTransform.png

 

The Match and cluster transform has some advanced options you can select:

  • Interpret empty strings as null values. This option is checked by default.
  • Allow null values to match. For matching purposes, if two columns have a null value then they will not match unless this option is checked.
  • Exclude rows from matching if the value of the following column is "True". Specify the name of a column that contains the "do not cluster" flag. If a row is flagged as "do not cluster", then that row will be placed in a cluster all by itself and be assigned a unique cluster ID value. The transform will not attempt to match the row to any other rows in the source table.

 

The results of running of the Match and cluster transform on my table are displayed below:

 

7_MatchAndClusterTransformResults.png

 

Notice that all the variations of the "Susan Woodard" records are assigned the same Cluster_ID value as are the "James Briggs" records.

 

With the addition of these two new transforms in SAS Data Studio 2.2. users no longer need to write custom code to achieve these common data preparation functions.  For more information on the new release of SAS Data Studio 2.2, you can refer to the documentation for SAS Viya 3.4: Data Preparation.

Version history
Last update:
‎08-13-2018 03:48 PM
Updated by:
Contributors

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

Free course: Data Literacy Essentials

Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning  and boost your career prospects.

Get Started

Article Labels
Article Tags