Validating your Values with the Compare Data Step

2 Likes

If you are regularly recreating or duplicating important data sets, you may need to compare your new table to the original table to confirm that the structure and content of your data has stayed the same. You might also need to quickly identify differences between older and newer versions of a data set. Regardless of why you need to compare your data, the Compare Data step has you covered!

In this post, I'll show you how to use the Compare Data step in SAS Studio to validate your values with ease. I'll cover two comparison methods: comparing variables from two data sets and comparing two variables in the same data set. Keep reading to learn how to compare your variables and create output tables to further analyze differences in values!

The Compare Data Step

With the Compare Data step (recently released in the 2025.11 stable release for SAS Viya), you can compare variables from two data sets or compare two variables in the same data set. This step is a point-and-click implementation of PROC COMPARE, a base SAS procedure typically used to validate datasets. This step will return comparisons of the properties of the data sets, variables, and observations, including printing the differences between mismatched variables.

You can select specific columns to compare and choose the method used for comparison (Absolute, Exact, Percent, or Relative). In addition, you can create an output data set and control the appearance of the Compare Data step results, plus include an optional output data set for summary statistics (for numeric variables only). Visit the documentation for more information on step capabilities.

The Compare Data step is available for use inside of a flow with the SAS Studio Analyst license.

Scenario

In this post, I'll be using the EMP95 and EMP96 data sets, which I've adapted from sample data available on the SAS Documentation website. EMP95 and EMP96 contain employee information across two separate years, including employee ID number, name, address, and salary data.

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

EMP95 has 10 observations. Notice that the data isn’t sorted by ID number.

EMP96 is similar, but with a few key differences. First, I've added a new variable, salary_95, which denotes the salary value stored for each employee in the EMP95 data set. For any new employees, salary_95 is missing. Additionally, EMP96 has 12 observations and has been sorted by ID number.

You can use the following code to create EMP95 and EMP96 in your own SAS environment:

data work.emp95;
   input #1 idnum $4. @6 name $15.
         #2 address $42.
         #3 salary 6.;
   datalines;
2388 James Schmidt
100 Apt. C Blount St. SW Raleigh NC 27693
92100
2457 Fred Williams
99 West Lane  Garner NC 27509
33190
2776 Robert Jones
12988 Wellington Farms Ave. Cary NC 27512
29025
8699 Jerry Capalleti
222 West L St. Oxford NC 27587
39985
2100 Lanny Engles
293 Manning Pl. Raleigh NC 27606
30998
9857 Kathy Krupski
1000 Taft Ave. Morrisville NC 27508
38756
0987 Dolly Lunford
2344 Persimmons Branch  Apex NC 27505
44010
3286 Hoa Nguyen
2818 Long St. Cary NC 27513
87734
6579 Bryan Samosky
3887 Charles Ave. Garner NC 27508
50234
3888 Kim Siu
5662 Magnolia Blvd Southeast Cary NC 27513
77558
;

data work.emp96;
   input #1 idnum $4. @6 name $15.
         #2 address $42.
         #3 salary 6.
         #4 salary_95 6.;
   datalines;
2388 James Schmidt
100 Apt. C Blount St. SW Raleigh NC 27693
92100
92100
2457 Fred Williams
99 West Lane  Garner NC 27509
33190
33190
2776 Robert Jones
12988 Wellington Farms Ave. Cary NC 27511
29025
29025
8699 Jerry Capalleti
222 West L St. Oxford NC 27587
39985
39985
3278 Mary Cravens
211 N. Cypress St. Cary NC 27512
35362
.
2100 Lanny Engles
293 Manning Pl. Raleigh NC 27606
30998
30998
9857 Kathy Krupski
100 Taft Ave. Morrisville NC 27508
40456
38756
0987 Dolly Lunford
2344 Persimmons Branch Trail Apex NC 27505
45110
44010
3286 Hoa Nguyen
2818 Long St. Cary NC 27513
89834
87734
6579 Bryan Samosky
3887 Charles Ave. Garner NC 27508
50234
50234
3888 Kim Siu
5662 Magnolia Blvd Southwest Cary NC 27513
79958
77558
6544 Roger Monday
3004 Crepe Myrtle Court Raleigh NC 27604
47007
.
;

proc sort data=work.emp96 out=emp96;
   by idnum;
run;

Comparing variables from two data sets

First, we’ll use the Compare Data step to compare all variables in EMP95 and EMP96.

I’ve added EMP95 and EMP96 to my flow and connected them to the input ports of the Compare Data step.

On the Select Tables tab, I can confirm I want to compare data between two data sets and review the properties of the base and comparison tables.

Next, on the Select Columns tab, I’ll choose how to match the data. Recall that EMP95 is not sorted, so I’ll choose to match data by ID variables. This will surface two tabs on the left side. First, I’ll select my ID variable, idnum.

Next, I can select which columns to compare. I’ll keep the default selection to compare all columns, but I could choose to only compare selected columns instead.

On the Comparison Criteria tab, I can select the method for judging equality. I will keep the default selection (Exact). I'll leave the additional options for how to treat missing values unselected. Treating a missing value as equal to any value can cause mismatches in this scenario, but this option can be valid for other comparison scenarios.

Lastly, I can control output behavior on the Output Data tab. I’ll include the output data set and select all options:

Write an observation for each observation in base data
Write an observation for each observation in comparison data
Include difference value
Include values for the percent differences
Suppress observations when all values are equal.

We have one numeric variable (salary), so I’ll include the output data set for summary statistics as well. I’ll keep the default maximum number of differences to print at 50, but the valid range is from 0 to 32,767.

Now, I can run my flow and review the results on the Submitted Code and Results tab.

First, the Results sub-tab displays the traditional reports generated by the COMPARE procedure. These results include a data set summary, variables summary, observation summary, values comparison summary, and value comparison results. A portion of these results are shown above.

Then, I can review my output tables. The main output table includes details for rows that were not equal. For each of the original observations, the output includes the base table row, the comparison table row, a row displaying differences, and a row displaying the percent difference (for numeric values only). For text values, matching characters are represented with periods and mismatched characters are represented with Xs. For numeric values, the difference between the base and comparison value is calculated.

Notice that for observation 1, the addresses are off by a few characters because the base observation's address did not include the street type, "Trail." Additionally, the salary value had a difference of 1100 or ~2.5%.

The output statistics table will only print results for numeric variables. The output includes N, MEAN, STD, MAX, MIN, STDERR, T PROBT, NDIF, DIFMEANS, and R,RSQ for the salary variable. These statistics are calculated for the base table variable and the comparison table variable. The difference and percent difference is included as well.

Next, we'll do a comparison of two variables in the same data set.

Comparing variables in one data set

Recall that EMP96 has two salary variables: salary and salary_95. I'll use the Compare Data step for the two salary values.

In a new flow, I’ll add a Compare Data step. On the Select Tables tab, I’ll specify that I want to compare data within a single data set. This will change the node to have only one input port. Then, I can add EMP96 to the flow and connect it to the input port.

Now, the Select Columns tabs doesn't give me an option of how to compare observations. Instead, I simply need to select my columns to compare from the table. As planned, I’ll select salary and salary_95.

On the Comparison Criteria and Output Data tabs, I will replicate the settings I selected in the prior demonstration (comparing variables with two tables). Then, I can run my flow and review the results.

I get the same type of reports on the Results sub-tab, but I'm most interested in the value comparison results at the end. Here, I can quickly see which values were mismatched and what the difference and percent difference values were.

This information is also reflected in the output table. The output table will appear simpler since we are only comparing one variable to another.

The output statistics table appears slightly different, because a _WITH_ variable is added to display the name of the comparison variable, salary_95. Otherwise, the type of statistics calculated remain the same.

Summary

In this post, I've demonstrated how to use the Compare Data step to compare variables from two data sets and compare two variables from the same data set. If you want to learn another data verification technique in SAS Studio, check out my previous post Data Verification Made Easy with Loqate in SAS Studio.

Do you compare variables or data sets often? Do you find the Compare Data step useful for your data validation needs? Share your thoughts, questions, and feedback below!

Find more articles from SAS Global Enablement and Learning here.

Validating your Values with the Compare Data Step

Catch up on SAS Innovate 2026

SAS AI and Machine Learning Courses