Manipulating Data in SAS Studio Flows Part 3: Deduplicating Data

1 Like

Have you ever needed to cut down a large data set, but aren’t sure how? Chances are you’ll need to remove duplicate rows or values in your data, whether that’s to eliminate redundancy or to simply analyze rows with unique field values. Much like other common data manipulation tasks, SAS Studio Flows has multiple steps that can remove duplicates in a few mouse clicks!

In this post, I'll show you three different steps which can perform three different types of deduplication: Remove Duplicates, Sort, and Query.

Remove Duplicates step

As the name suggests, the Remove Duplicates step will remove duplicate rows from your data. While other steps have some flexibility regarding how duplicates are removed, this step will only remove complete duplicate rows. In other words, this step eliminates rows that match the values of another row for every field. Visit the documentation for a full description of step capabilities.

Say you’re working with an augmented version of the SASHELP.CLASS table, CLASSDUPS. This table includes duplicates of some student records.

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

To use the Remove Duplicates step, simply connect your source data to the node. The only required setting is a simple check box to remove duplicates across all columns.

If desired, you can adjust output settings. By default, the step will replace the existing output table with the same name and write duplicate rows to the log, but you can choose to store duplicates in a separate table instead.

After running the step, all duplicate rows are removed.

Duplicates are also written to the log due to the default output settings.

Sort step

The Sort step was the main subject of my last post. Though its primary capability is to sort data, this step can also be used to remove duplicate rows. Additionally, this step allows more control over how to handle duplicates – you can choose to remove duplicate rows based only on the sorting columns, or by all columns. Visit the documentation for a full description of step capabilities.

Say you’re analyzing the SASHELP.FISH table, which contains measurements of fish caught in Lake Laengelmaevesi. Measurements include weight, three different lengths (nose to beginning of tail, nose to notch of tail, and nose to end of tail), height, and width.

You’d like to quickly extract the first appearance of each species in the data. This data has been pre-sorted to list each specimen by increasing length within each species, meaning that the first occurrence of each species in the data will have the shortest length measurement.

First, configure the step to sort the source data by Species in ascending order.

Notice the Options button on the tab’s toolbar. If selected, you can configure duplicate row behavior with three options: Keep all rows, Remove rows with duplicate values in sort columns, or Remove duplicate rows. The final option would yield results identical to the Remove Duplicates step. For the desired result, select Remove rows with duplicate values in sort columns.

Once you run this step, the output data will show only one row for each species, since all duplicate values in the species column were removed. The remaining entries represent the shortest specimens in each species.

Query step

The Query step is powerful and multi-faceted, capable of performing many different actions on your data. Not only can it sort data, but it can remove duplicates with flexibility like the Sort step by using the option to select distinct rows only for the results. Visit the documentation for a full description of step capabilities.

Say you’re analyzing SASHELP.ORSALES, a table containing data about the fictional sports store Orion Star.

ORSALES has 912 rows, but we can use the Query step to quickly find all of the product lines and categories that Orion Star carries. After adding and connecting the appropriate nodes, start by selecting Product_Line and Product_Category for the query output.

Then, navigate to the column options > More options and select Select distinct rows only. With this option toggled, the output will contain only the distinct combinations of unique Product_Line and Product_Category values from the source data.

The output has 12 rows. These represent four product lines: Children, Clothes & Shoes, Outdoors, and Sports. Most lines only have one or two product categories except for the Sports line, which has eight categories.

Notice that this deduplication method differs from the Sort step method. The Sort step enabled deduplication of sort columns only, but the output contained all source columns. Alternatively, the Query step enabled deduplication of the selected columns, but the output contained only the selected columns. The option to select distinct rows will apply to every selected column. Therefore, if all columns are selected for output, the distinct rows option will behave like the Remove Duplicates step.

Considerations & Summary

These deduplication options are available with every SAS Studio license (much like the sorting options shown in my previous post).

As I’ve demonstrated, each step mentioned in this post has slightly different capabilities. You’ll need to choose your deduplication step based on exactly which fields you want to remove duplicates from and which fields you want to include in the output.

In this post, I’ve discussed three methods for deduplicating data in SAS Studio Flows. Want to know more about working with data in Flows? Check out my previous posts in this series on appending and sorting data. Look out for future posts on aggregating data and performing table lookups!

Which of these deduplication methods is most useful for your work? Will you use these methods for improving data quality, exploring data, or both? Let me know!

Find more articles from SAS Global Enablement and Learning here.

SASJedi · ‎09-21-2024

This is great, @GraceBarnhill! Easy to follow and oh so useful.

Manipulating Data in SAS Studio Flows Part 3: Deduplicating Data

The 2025 SAS Hackathon has begun!

SAS AI and Machine Learning Courses