BookmarkSubscribeRSS Feed

SAS Viya 3.5: Remove Duplicate Records in SAS Data Studio

Started ‎03-31-2020 by
Modified ‎03-31-2020 by
Views 5,320

In the SAS Viya 3.5 release of SAS Data Studio if you have a SAS Data Preparation license, you have access to a new transform called Remove Duplicates. This transform returns only the unique records in a data set according to the criteria you specify. This transform is a step towards performing survivorship without having to write code yourself; however, in order to specify which duplicate record you want to keep from a grouping you will still need to write code. In this blog, I will walk through an example of using the Remove Duplicates transform in SAS Data Studio as well as a case where you still need to use the Code transform.

   

Example Data

For my example, I will use the records below as my starting data set for removing duplicates:

 

1_StartingData.png

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

 

I standardize data, generate match codes to facility fuzzy matching, and cluster (match) the records according to the following rules in SAS Data Studio:

 

2_MatchAndCluster.png

.

 

Any record that is considered a duplicate according to the rules I specified above has the same Cluster_ID value. The results of clustering the records in SAS Data Studio have been written to the table below: 

 

3_ClusteredData.png

 

Now that I have established which records in the set are considered duplicates, I will demonstrate two different methods for removing the duplicate records and keeping only record per Cluster_ID grouping in SAS Data Studio.

 

Remove Duplicates Transform Example

In this scenario, I use the Remove Duplicates transform in the Data Quality Transforms section to keep only one record from each Cluster_ID grouping. Since I don't have duplicate records where all columns have the same value, I uncheck the default option to Remove duplicates across all columns. This allows me to select the columns to group by. In my case I only want to group by the Cluster_ID column. 

 

4_RemoveDuplicatesTransform.png

 

Running this transform and saving the results to table, you can see that only one record per Cluster_ID is kept as shown in the screenshot below.

 

5_RemoveDuplicatesTransformResults.png

 

However, the Remove Duplicates transform does not allow me to specify which record to keep from the grouping, it simply selects one. This may be fine in some cases, but if I want to specify which record to keep from the group, then I need to use code and I will demonstrate how to do that in the next example.

   

Remove Duplicates using Code Transform Example

In this scenario, I use the Code transform in the Custom Transforms section to keep only one record from each Cluster_ID grouping and I specify to keep the record with the most recent Last_Contact date.

 

6_CodeTransform.png

 

Here's a closer look at the DATA step code I used:

 

/* BEGIN data step with the output table data */
data {{_dp_outputTable}} (caslib={{_dp_outputCaslib}} promote="no");
/* Set the input set */
set {{_dp_inputTable}} (caslib={{_dp_inputCaslib}} );
/* group by Cluster_ID, then by Last_Contact */
by Cluster_ID Last_Contact;
/* return last record in grouping which will be the most recent Last_Contact date */
   if last.Cluster_ID then output;
run;

 

Running this code results in only one record returned per Cluster_ID and the record kept is the one with the latest Last_Contact date as shown in the screenshot below. Note the de-duplicated data set is different from my previous example since I specified which record to keep from each grouping through code

 

7_CodeTransformResults.png

 

In conclusion, the new Remove Duplicates transform in SAS Data Studio in SAS Viya 3.5 is useful for removing duplicate records from a data set; however, if you need more control on which record is kept within a grouping, then you will still need to use the Code transform for now. Another option to use in your code than the one I demonstrated is to use the CAS action groupByInfo. For more information on the Remove Duplicates transform in SAS Data Studio, please refer to its documentation.   

Version history
Last update:
‎03-31-2020 10:28 AM
Updated by:
Contributors

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

Free course: Data Literacy Essentials

Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning  and boost your career prospects.

Get Started

Article Tags