Duplicates in our data can badly skew the results of our analysis. In this post, I’ll cover data deduplication using PROC SORT with the NODUPKEY, OUT=, and DUPOUT= options. We will also look at using PROC SQL and PROC FedSQL for deduplication. It’s possible to deduplicate data that has messy text using PROC SORT if you have numeric columns that can be combined to make a row unique. Otherwise, the text data must be standardized so that identical values can be detected. First, a peek at the data. We’ll be working with the crime table. In this image, the rows we need to remove as duplicates are highlighted in red. The reason they should be discarded is noted in the Reason column. Of course, the actual crime table does not include the Disposition and Reason columns, or this would be a really easy task 😁
Row |
Date_Rptd |
DATE_OCC |
TIME_OCC |
AREA_NAME |
Rpt_Dist_No |
Vict_Age |
Vict_Sex |
Vict_Descent |
Status_Desc |
1 |
2/6/2024 |
1/21/2024 |
0:16:40 |
N Hollywood |
1503 |
31
|
F |
W |
Adult Arrest |
2 |
2/1/2024 |
1/30/2024 |
0:27:25 |
Hollywood |
622 |
52 |
F |
W |
Adult Arrest |
3 |
2/1/2024 |
1/30/2024 |
0:27:25 |
HOLLYWOOD |
622 |
52 |
F |
W |
ADULT ARREST |
4 |
2/2/2024 |
2/2/2024 |
0:28:52 |
Wilshire |
724 |
0 |
X |
X |
Adult Arrest |
5 |
2/2/2024 |
2/2/2024 |
0:28:52 |
WILSHIRE |
724 |
0 |
X |
X |
ADULT ARREST |
6 |
2/4/2024 |
2/3/2024 |
0:32:15 |
Van Nuys |
926 |
0 |
X |
X |
Adult Arrest |
7 |
2/4/2024 |
2/3/2024 |
0:32:15 |
Van Nuys |
926 |
0 |
x |
x |
ARREST |
8 |
2/4/2024 |
2/4/2024 |
0:18:20 |
Mission |
1908 |
37 |
F |
H |
Adult Arrest |
9 |
2/4/2024 |
2/4/2024 |
0:18:20 |
MISSION |
1908 |
37 |
F |
H |
ADULT ARREST |
10 |
2/4/2024 |
2/4/2024 |
0:26:40 |
Harbor |
587 |
39 |
F |
W |
Adult Arrest |
11 |
2/4/2024 |
2/4/2024 |
0:26:40 |
HARBOR |
587 |
39 |
F |
W |
ADULT ARREST |
12 |
2/5/2024 |
2/4/2024 |
0:28:40 |
Northeast |
1152 |
33 |
F |
B |
Juv Arrest |
13 |
2/4/2024 |
2/4/2024 |
0:30:45 |
Hollywood |
645 |
25 |
F |
B |
Adult Arrest |
14 |
2/4/2024 |
2/4/2024 |
0:30:45 |
Hollywd |
645 |
25 |
f |
b |
ARREST |
15 |
2/4/2024 |
2/4/2024 |
0:34:15 |
Harbor |
504 |
0 |
F |
H |
Adult Arrest |
16 |
2/4/2024 |
2/4/2024 |
0:35:20 |
Olympic |
2055 |
22 |
F |
H |
Adult Arrest |
17 |
2/4/2024 |
2/4/2024 |
0:35:20 |
OLYMPIC |
2055 |
22 |
F |
H |
ADULT ARREST |
18 |
2/6/2024 |
2/6/2024 |
0:14:00 |
Wilshire |
787 |
51 |
M |
W |
Adult Arrest |
19 |
2/6/2024 |
2/6/2024 |
0:30:15 |
Harbor |
567 |
45 |
F |
H |
Adult Arrest |
20 |
2/6/2024 |
2/6/2024 |
0:30:15 |
HARBOR |
567 |
45 |
F |
H |
ADULT ARREST |
21 |
2/7/2024 |
2/7/2024 |
0:00:15 |
West LA |
853 |
55 |
F |
O |
Adult Arrest |
22 |
2/7/2024 |
2/7/2024 |
0:00:15 |
West LA |
853 |
55 |
f |
o |
ARREST |
23 |
2/7/2024 |
2/7/2024 |
0:13:45 |
Newton |
1309 |
0 |
M |
A |
Adult Arrest |
24 |
2/7/2024 |
2/7/2024 |
0:13:45 |
Newton |
1309 |
0 |
M |
A |
Adult Arest |
25 |
2/7/2024 |
2/7/2024 |
0:23:20 |
Southwest |
393 |
0 |
X |
X |
Adult Arrest |
26 |
2/7/2024 |
2/7/2024 |
0:23:20 |
Southwest |
393 |
0 |
x |
x |
ARREST |
Our goal is to deduplicate this table, producing this final result:
Row |
Date_Rptd |
DATE_OCC |
TIME_OCC |
AREA_NAME |
Rpt_Dist_No |
Vict_Age |
Vict_Sex |
Vict_Descent |
Status_Desc |
1 |
2/7/2024 |
2/7/2024 |
0:00:15 |
West LA |
853 |
55 |
F |
O |
Adult Arrest |
2 |
2/7/2024 |
2/7/2024 |
0:13:45 |
Newton |
1309 |
0 |
M |
A |
Adult Arrest |
3 |
2/6/2024 |
2/6/2024 |
0:14:00 |
Wilshire |
787 |
51 |
M |
W |
Adult Arrest |
4 |
2/6/2024 |
1/21/2024 |
0:16:40 |
N Hollywood |
1503 |
31 |
F |
W |
Adult Arrest |
5 |
2/4/2024 |
2/4/2024 |
0:18:20 |
Mission |
1908 |
37 |
F |
H |
Adult Arrest |
6 |
2/7/2024 |
2/7/2024 |
0:23:20 |
Southwest |
393 |
0 |
X |
X |
Adult Arrest |
7 |
2/4/2024 |
2/4/2024 |
0:26:40 |
Harbor |
587 |
39 |
F |
W |
Adult Arrest |
8 |
2/1/2024 |
1/30/2024 |
0:27:25 |
Hollywood |
622 |
52 |
F |
W |
Adult Arrest |
9 |
2/5/2024 |
2/4/2024 |
0:28:40 |
Northeast |
1152 |
33 |
F |
B |
Juv Arrest |
10 |
2/2/2024 |
2/2/2024 |
0:28:52 |
Wilshire |
724 |
0 |
X |
X |
Adult Arrest |
11 |
2/6/2024 |
2/6/2024 |
0:30:15 |
Harbor |
567 |
45 |
F |
H |
Adult Arrest |
12 |
2/4/2024 |
2/4/2024 |
0:30:45 |
Hollywood |
645 |
25 |
F |
B |
Adult Arrest |
13 |
2/4/2024 |
2/3/2024 |
0:32:15 |
Van Nuys |
926 |
0 |
X |
X |
Adult Arrest |
14 |
2/4/2024 |
2/4/2024 |
0:34:15 |
Harbor |
504 |
0 |
F |
H |
Adult Arrest |
15 |
2/4/2024 |
2/4/2024 |
0:35:20 |
Olympic |
2055 |
22 |
F |
H |
Adult Arrest |
As I look over the data, I can see that I’d need to standardize the text fields before I could use them for deduplication. But this table has a lot of numeric columns, and I’m willing to bet that the combination of Date_Rptd, DATE_OCC, TIME_OCC, Rpt_Dist_No, and Vict_Age will be unique for each row. In my previous post, we discussed sorting using PROC SORT – but that procedure can do so much more! I can use PROC SORT with the NODUPKEY option to deduplicate this data if I use all of these numeric variables in my BY statement. I’ll want to see the rows that got rejected, just so I can verify that my scheme is working. I can use the DUPOUT= option for that. The PROC SORT code looks like this:
proc sort data=crime nodupkey
out=nodups dupout=dups;
by _numeric_;
run;
Row |
Date_Rptd |
DATE_OCC |
TIME_OCC |
AREA_NAME |
Rpt_Dist_No |
Vict_Age |
Vict_Sex |
Vict_Descent |
Status_Desc |
1 |
2/7/2024 |
2/7/2024 |
0:00:15 |
West LA |
853 |
55 |
F |
O |
Adult Arrest |
2 |
2/7/2024 |
2/7/2024 |
0:13:45 |
Newton |
1309 |
0 |
M |
A |
Adult Arrest |
3 |
2/6/2024 |
2/6/2024 |
0:14:00 |
Wilshire |
787 |
51 |
M |
W |
Adult Arrest |
4 |
2/6/2024 |
1/21/2024 |
0:16:40 |
N Hollywood |
1503 |
31 |
F |
W |
Adult Arrest |
5 |
2/4/2024 |
2/4/2024 |
0:18:20 |
Mission |
1908 |
37 |
F |
H |
Adult Arrest |
6 |
2/7/2024 |
2/7/2024 |
0:23:20 |
Southwest |
393 |
0 |
X |
X |
Adult Arrest |
7 |
2/4/2024 |
2/4/2024 |
0:26:40 |
Harbor |
587 |
39 |
F |
W |
Adult Arrest |
8 |
2/1/2024 |
1/30/2024 |
0:27:25 |
Hollywood |
622 |
52 |
F |
W |
Adult Arrest |
9 |
2/5/2024 |
2/4/2024 |
0:28:40 |
Northeast |
1152 |
33 |
F |
B |
Juv Arrest |
10 |
2/2/2024 |
2/2/2024 |
0:28:52 |
Wilshire |
724 |
0 |
X |
X |
Adult Arrest |
11 |
2/6/2024 |
2/6/2024 |
0:30:15 |
Harbor |
567 |
45 |
F |
H |
Adult Arrest |
12 |
2/4/2024 |
2/4/2024 |
0:30:45 |
Hollywood |
645 |
25 |
F |
B |
Adult Arrest |
13 |
2/4/2024 |
2/3/2024 |
0:32:15 |
Van Nuys |
926 |
0 |
X |
X |
Adult Arrest |
14 |
2/4/2024 |
2/4/2024 |
0:34:15 |
Harbor |
504 |
0 |
F |
H |
Adult Arrest |
15 |
2/4/2024 |
2/4/2024 |
0:35:20 |
Olympic |
2055 |
22 |
F |
H |
Adult Arrest |
Row |
Date_Rptd |
DATE_OCC |
TIME_OCC |
AREA_NAME |
Rpt_Dist_No |
Vict_Age |
Vict_Sex |
Vict_Descent |
Status_Desc |
1 |
2/7/2024 |
2/7/2024 |
0:00:15 |
West LA |
853 |
55 |
f |
o |
ARREST |
2 |
2/7/2024 |
2/7/2024 |
0:13:45 |
Newton |
1309 |
0 |
M |
A |
Adult Arest |
3 |
2/4/2024 |
2/4/2024 |
0:18:20 |
MISSION |
1908 |
37 |
F |
H |
ADULT ARREST |
4 |
2/7/2024 |
2/7/2024 |
0:23:20 |
Southwest |
393 |
0 |
x |
x |
ARREST |
5 |
2/4/2024 |
2/4/2024 |
0:26:40 |
HARBOR |
587 |
39 |
F |
W |
ADULT ARREST |
6 |
2/1/2024 |
1/30/2024 |
0:27:25 |
HOLLYWOOD |
622 |
52 |
F |
W |
ADULT ARREST |
7 |
2/2/2024 |
2/2/2024 |
0:28:52 |
WILSHIRE |
724 |
0 |
X |
X |
ADULT ARREST |
8 |
2/6/2024 |
2/6/2024 |
0:30:15 |
HARBOR |
567 |
45 |
F |
H |
ADULT ARREST |
9 |
2/4/2024 |
2/4/2024 |
0:30:45 |
Hollywd |
645 |
25 |
f |
b |
ARREST |
10 |
2/4/2024 |
2/3/2024 |
0:32:15 |
Van Nuys |
926 |
0 |
x |
x |
ARREST |
11 |
2/4/2024 |
2/4/2024 |
0:35:20 |
OLYMPIC |
2055 |
22 |
F |
H |
ADULT ARREST |
But, what if there were not sufficient numeric columns to create a distinct identity for each row? For example, table work.crimes2:
Row |
Date_Rptd |
AREA_NAME |
Vict_Sex |
Vict_Descent |
Status_Desc |
1 |
2/1/2024 |
HOLLYWOOD |
F |
W |
ADULT ARREST |
2 |
2/1/2024 |
Hollywood |
F |
W |
Adult Arrest |
3 |
2/2/2024 |
WILSHIRE |
X |
X |
ADULT ARREST |
4 |
2/2/2024 |
Wilshire |
X |
X |
Adult Arrest |
5 |
2/4/2024 |
HARBOR |
F |
W |
ADULT ARREST |
6 |
2/4/2024 |
Harbor |
F |
W |
Adult Arrest |
7 |
2/4/2024 |
Harbor |
F |
H |
Adult Arrest |
8 |
2/4/2024 |
Hollywd |
f |
b |
ARREST |
9 |
2/4/2024 |
Hollywood |
F |
B |
Adult Arrest |
10 |
2/4/2024 |
MISSION |
F |
H |
ADULT ARREST |
11 |
2/4/2024 |
Mission |
F |
H |
Adult Arrest |
12 |
2/4/2024 |
OLYMPIC |
F |
H |
ADULT ARREST |
13 |
2/4/2024 |
Olympic |
F |
H |
Adult Arrest |
14 |
2/4/2024 |
Van Nuys |
X |
X |
Adult Arrest |
15 |
2/4/2024 |
Van Nuys |
x |
x |
ARREST |
16 |
2/5/2024 |
Northeast |
F |
B |
Juv Arrest |
17 |
2/6/2024 |
HARBOR |
F |
H |
ADULT ARREST |
18 |
2/6/2024 |
Harbor |
F |
H |
Adult Arrest |
19 |
2/6/2024 |
N Hollywood |
F |
W |
Adult Arrest |
20 |
2/6/2024 |
Wilshire |
M |
W |
Adult Arrest |
21 |
2/7/2024 |
Newton |
M |
A |
Adult Arrest |
22 |
2/7/2024 |
Newton |
M |
A |
Adult Arest |
23 |
2/7/2024 |
Southwest |
X |
X |
Adult Arrest |
24 |
2/7/2024 |
Southwest |
x |
x |
ARREST |
25 |
2/7/2024 |
West LA |
F |
O |
Adult Arrest |
26 |
2/7/2024 |
West LA |
f |
o |
ARREST |
The character columns still contain irregularities, but now the only numeric column is Date_Rptd – and the values there are not unique. Using my previous trick of sorting BY _NUMERIC_, many rows are erroneously discarded as duplicates. Instead of the expected 15 rows, the nodups data set includes only 6:
Row |
Date_Rptd |
AREA_NAME |
Vict_Sex |
Vict_Descent |
Status_Desc |
1 |
2/1/2024 |
HOLLYWOOD |
F |
W |
ADULT ARREST |
2 |
2/2/2024 |
WILSHIRE |
X |
X |
ADULT ARREST |
3 |
2/4/2024 |
HARBOR |
F |
W |
ADULT ARREST |
4 |
2/5/2024 |
Northeast |
F |
B |
Juv Arrest |
5 |
2/6/2024 |
HARBOR |
F |
H |
ADULT ARREST |
6 |
2/7/2024 |
Southwest |
x |
x |
ARREST |
To ensure I remove only rows that are complete duplicates of another, I’ll have to sort on the contents of all columns. SAS provides that handy _ALL_ special SAS name variable list that specifies all of the variables in a data set, saving me from having to type them all individually. But the data in those character columns is messy. I’ll need to standardize it before relying on it for de-duplication. In one of my previous posts, Coding for Data Quality in SAS Viya Part 2 – Standardization, I showcased some powerful tools available to SAS Viya programmers (and SAS 9 programmers who license SAS Data Quality) for standardizing data. In this case, I’ll use only Base SAS functionality in the DATA step. This will work fine for the simple issues found in this small data set. Note that the DATA step uses another handy variable list, this time the name prefix list:
data crime_std;
set crime2;
array charvars[*] _character_;
/* Drop all variables with names starting with an underscore */
drop _:;
/* Make all character values UPPER case */
do _i=1 to dim(charvars);
charvars[_i]= propcase(charvars[_i]);
end;
/* Correct the known spelling errors */
Status_Desc=tranwrd(Status_Desc,'Arest','Arrest');
AREA_NAME=tranwrd(AREA_NAME,'Hollywd','Hollywood');
/* Standardize the truncated values for Adult Arrest */
if Status_Desc='Arrest' then Status_Desc='Adult Arrest';
run;
With the character data standardized, deduplication will be easy:
Row |
Date_Rptd |
AREA_NAME |
Vict_Sex |
Vict_Descent |
Status_Desc |
1 |
2/1/2024 |
Hollywood |
F |
W |
Adult Arrest |
2 |
2/1/2024 |
Hollywood |
F |
W |
Adult Arrest |
3 |
2/2/2024 |
Wilshire |
X |
X |
Adult Arrest |
4 |
2/2/2024 |
Wilshire |
X |
X |
Adult Arrest |
5 |
2/4/2024 |
Harbor |
F |
W |
Adult Arrest |
6 |
2/4/2024 |
Harbor |
F |
H |
Adult Arrest |
7 |
2/4/2024 |
Harbor |
F |
W |
Adult Arrest |
8 |
2/4/2024 |
Hollywood |
F |
B |
Adult Arrest |
9 |
2/4/2024 |
Hollywood |
F |
B |
Adult Arrest |
10 |
2/4/2024 |
Mission |
F |
H |
Adult Arrest |
11 |
2/4/2024 |
Mission |
F |
H |
Adult Arrest |
12 |
2/4/2024 |
Olympic |
F |
H |
Adult Arrest |
13 |
2/4/2024 |
Olympic |
F |
H |
Adult Arrest |
14 |
2/4/2024 |
Van Nuys |
X |
X |
Adult Arrest |
15 |
2/4/2024 |
Van Nuys |
X |
X |
Adult Arrest |
16 |
2/5/2024 |
Northeast |
F |
B |
Juv Arrest |
17 |
2/6/2024 |
Harbor |
F |
H |
Adult Arrest |
18 |
2/6/2024 |
Harbor |
F |
H |
Adult Arrest |
19 |
2/6/2024 |
N Hollywood |
F |
W |
Adult Arrest |
20 |
2/6/2024 |
Wilshire |
M |
W |
Adult Arrest |
21 |
2/7/2024 |
Newton |
M |
A |
Adult Arrest |
22 |
2/7/2024 |
Newton |
M |
A |
Adult Arrest |
23 |
2/7/2024 |
Southwest |
X |
X |
Adult Arrest |
24 |
2/7/2024 |
Southwest |
X |
X |
Adult Arrest |
25 |
2/7/2024 |
West La |
F |
O |
Adult Arrest |
26 |
2/7/2024 |
West La |
F |
O |
Adult Arrest |
I’ll use PROC SORT NODUPKEY with a BY _ALL_ statement:
proc sort data=crime_std
out=nodups
dupout=dups
nodupkey
;
by _all_;
run;
This produces the desired result and, as a bonus, the resulting data is much more presentable:
Row |
Date_Rptd |
AREA_NAME |
Vict_Sex |
Vict_Descent |
Status_Desc |
1 |
2/1/2024 |
Hollywood |
F |
W |
Adult Arrest |
2 |
2/2/2024 |
Wilshire |
X |
X |
Adult Arrest |
3 |
2/4/2024 |
Harbor |
F |
H |
Adult Arrest |
4 |
2/4/2024 |
Harbor |
F |
W |
Adult Arrest |
5 |
2/4/2024 |
Hollywood |
F |
B |
Adult Arrest |
6 |
2/4/2024 |
Mission |
F |
H |
Adult Arrest |
7 |
2/4/2024 |
Olympic |
F |
H |
Adult Arrest |
8 |
2/4/2024 |
Van Nuys |
X |
X |
Adult Arrest |
9 |
2/5/2024 |
Northeast |
F |
B |
Juv Arrest |
10 |
2/6/2024 |
Harbor |
F |
H |
Adult Arrest |
11 |
2/6/2024 |
N Hollywood |
F |
W |
Adult Arrest |
12 |
2/6/2024 |
Wilshire |
M |
W |
Adult Arrest |
13 |
2/7/2024 |
Newton |
M |
A |
Adult Arrest |
14 |
2/7/2024 |
Southwest |
X |
X |
Adult Arrest |
15 |
2/7/2024 |
West La |
F |
O |
Adult Arrest |
With the data pre-standardized, I can use either PROC SQL or PROC FedSQL to do my deduplicating, if desired. Both of these steps will produce results identical to the PROC SORT output:
proc sql;
/* Create the de-duplicated table */
create table nodups_sql as
select distinct *
from crime_std
;
quit;
proc FedSQL;
/* First, make sure the table does not exist.
FedSQL does not overwrite existing tables */
drop table nodups_FedSQL force;
/* Create the de-duplicated table */
create table nodups_FedSQL as
select distinct *
from crime_std
;
quit;
So today we covered some of the challenges and pitfalls when deduplicating data and discovered how special variable lists _NUMERIC_, _ALL_, and prefix: can make short work of long lists of variables when writing SAS programs. Do you have any favorite tips for de-duplicating data? What other special variable lists have you found useful in your SAS code?
Until next time, may the SAS be with you!
Mark
Grab the ZIP file containing the code and data for this blog series from my GitHub at https://github.com/SASJedi/blogPackages/raw/main/data_manipulation_in_base_sas.zip Links to prior posts in this series:
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.