Duplicates in our data can badly skew the results of our analysis. In this post, I’ll cover data deduplication using PROC SORT with the NODUPKEY, OUT=, and DUPOUT= options. We will also look at using PROC SQL and PROC FedSQL for deduplication. It’s possible to deduplicate data that has messy text using PROC SORT if you have numeric columns that can be combined to make a row unique. Otherwise, the text data must be standardized so that identical values can be detected. First, a peek at the data. We’ll be working with the crime table. In this image, the rows we need to remove as duplicates are highlighted in red. The reason they should be discarded is noted in the Reason column. Of course, the actual crime table does not include the Disposition and Reason columns, or this would be a really easy task 😁
Row
Date_Rptd
DATE_OCC
TIME_OCC
AREA_NAME
Rpt_Dist_No
Vict_Age
Vict_Sex
Vict_Descent
Status_Desc
1
2/6/2024
1/21/2024
0:16:40
N Hollywood
1503
31
F
W
Adult Arrest
2
2/1/2024
1/30/2024
0:27:25
Hollywood
622
52
F
W
Adult Arrest
3
2/1/2024
1/30/2024
0:27:25
HOLLYWOOD
622
52
F
W
ADULT ARREST
4
2/2/2024
2/2/2024
0:28:52
Wilshire
724
0
X
X
Adult Arrest
5
2/2/2024
2/2/2024
0:28:52
WILSHIRE
724
0
X
X
ADULT ARREST
6
2/4/2024
2/3/2024
0:32:15
Van Nuys
926
0
X
X
Adult Arrest
7
2/4/2024
2/3/2024
0:32:15
Van Nuys
926
0
x
x
ARREST
8
2/4/2024
2/4/2024
0:18:20
Mission
1908
37
F
H
Adult Arrest
9
2/4/2024
2/4/2024
0:18:20
MISSION
1908
37
F
H
ADULT ARREST
10
2/4/2024
2/4/2024
0:26:40
Harbor
587
39
F
W
Adult Arrest
11
2/4/2024
2/4/2024
0:26:40
HARBOR
587
39
F
W
ADULT ARREST
12
2/5/2024
2/4/2024
0:28:40
Northeast
1152
33
F
B
Juv Arrest
13
2/4/2024
2/4/2024
0:30:45
Hollywood
645
25
F
B
Adult Arrest
14
2/4/2024
2/4/2024
0:30:45
Hollywd
645
25
f
b
ARREST
15
2/4/2024
2/4/2024
0:34:15
Harbor
504
0
F
H
Adult Arrest
16
2/4/2024
2/4/2024
0:35:20
Olympic
2055
22
F
H
Adult Arrest
17
2/4/2024
2/4/2024
0:35:20
OLYMPIC
2055
22
F
H
ADULT ARREST
18
2/6/2024
2/6/2024
0:14:00
Wilshire
787
51
M
W
Adult Arrest
19
2/6/2024
2/6/2024
0:30:15
Harbor
567
45
F
H
Adult Arrest
20
2/6/2024
2/6/2024
0:30:15
HARBOR
567
45
F
H
ADULT ARREST
21
2/7/2024
2/7/2024
0:00:15
West LA
853
55
F
O
Adult Arrest
22
2/7/2024
2/7/2024
0:00:15
West LA
853
55
f
o
ARREST
23
2/7/2024
2/7/2024
0:13:45
Newton
1309
0
M
A
Adult Arrest
24
2/7/2024
2/7/2024
0:13:45
Newton
1309
0
M
A
Adult Arest
25
2/7/2024
2/7/2024
0:23:20
Southwest
393
0
X
X
Adult Arrest
26
2/7/2024
2/7/2024
0:23:20
Southwest
393
0
x
x
ARREST
Rows 2 and 3,4 and 5, 8 and 9, 10 and 11, 16 and 17, and 19 and 20 were probably duplicated due to inconsistent text value casing. Row 7 has a truncated text value for Status_Desc, but is otherwise identical to row 6. Row 24 has a misspelling in the value for Status_Desc, but is otherwise identical to row 23.
Our goal is to deduplicate this table, producing this final result:
Row
Date_Rptd
DATE_OCC
TIME_OCC
AREA_NAME
Rpt_Dist_No
Vict_Age
Vict_Sex
Vict_Descent
Status_Desc
1
2/7/2024
2/7/2024
0:00:15
West LA
853
55
F
O
Adult Arrest
2
2/7/2024
2/7/2024
0:13:45
Newton
1309
0
M
A
Adult Arrest
3
2/6/2024
2/6/2024
0:14:00
Wilshire
787
51
M
W
Adult Arrest
4
2/6/2024
1/21/2024
0:16:40
N Hollywood
1503
31
F
W
Adult Arrest
5
2/4/2024
2/4/2024
0:18:20
Mission
1908
37
F
H
Adult Arrest
6
2/7/2024
2/7/2024
0:23:20
Southwest
393
0
X
X
Adult Arrest
7
2/4/2024
2/4/2024
0:26:40
Harbor
587
39
F
W
Adult Arrest
8
2/1/2024
1/30/2024
0:27:25
Hollywood
622
52
F
W
Adult Arrest
9
2/5/2024
2/4/2024
0:28:40
Northeast
1152
33
F
B
Juv Arrest
10
2/2/2024
2/2/2024
0:28:52
Wilshire
724
0
X
X
Adult Arrest
11
2/6/2024
2/6/2024
0:30:15
Harbor
567
45
F
H
Adult Arrest
12
2/4/2024
2/4/2024
0:30:45
Hollywood
645
25
F
B
Adult Arrest
13
2/4/2024
2/3/2024
0:32:15
Van Nuys
926
0
X
X
Adult Arrest
14
2/4/2024
2/4/2024
0:34:15
Harbor
504
0
F
H
Adult Arrest
15
2/4/2024
2/4/2024
0:35:20
Olympic
2055
22
F
H
Adult Arrest
As I look over the data, I can see that I’d need to standardize the text fields before I could use them for deduplication. But this table has a lot of numeric columns, and I’m willing to bet that the combination of Date_Rptd, DATE_OCC, TIME_OCC, Rpt_Dist_No, and Vict_Age will be unique for each row. In my previous post, we discussed sorting using PROC SORT – but that procedure can do so much more! I can use PROC SORT with the NODUPKEY option to deduplicate this data if I use all of these numeric variables in my BY statement. I’ll want to see the rows that got rejected, just so I can verify that my scheme is working. I can use the DUPOUT= option for that. The PROC SORT code looks like this:
proc sort data=crime nodupkey
out=nodups dupout=dups;
by _numeric_;
run;
_NUMERIC_ is a very handy special SAS name variable list that specifies all numeric variables in a data set without having to list them individually. The results are just what I had hoped:
Row
Date_Rptd
DATE_OCC
TIME_OCC
AREA_NAME
Rpt_Dist_No
Vict_Age
Vict_Sex
Vict_Descent
Status_Desc
1
2/7/2024
2/7/2024
0:00:15
West LA
853
55
F
O
Adult Arrest
2
2/7/2024
2/7/2024
0:13:45
Newton
1309
0
M
A
Adult Arrest
3
2/6/2024
2/6/2024
0:14:00
Wilshire
787
51
M
W
Adult Arrest
4
2/6/2024
1/21/2024
0:16:40
N Hollywood
1503
31
F
W
Adult Arrest
5
2/4/2024
2/4/2024
0:18:20
Mission
1908
37
F
H
Adult Arrest
6
2/7/2024
2/7/2024
0:23:20
Southwest
393
0
X
X
Adult Arrest
7
2/4/2024
2/4/2024
0:26:40
Harbor
587
39
F
W
Adult Arrest
8
2/1/2024
1/30/2024
0:27:25
Hollywood
622
52
F
W
Adult Arrest
9
2/5/2024
2/4/2024
0:28:40
Northeast
1152
33
F
B
Juv Arrest
10
2/2/2024
2/2/2024
0:28:52
Wilshire
724
0
X
X
Adult Arrest
11
2/6/2024
2/6/2024
0:30:15
Harbor
567
45
F
H
Adult Arrest
12
2/4/2024
2/4/2024
0:30:45
Hollywood
645
25
F
B
Adult Arrest
13
2/4/2024
2/3/2024
0:32:15
Van Nuys
926
0
X
X
Adult Arrest
14
2/4/2024
2/4/2024
0:34:15
Harbor
504
0
F
H
Adult Arrest
15
2/4/2024
2/4/2024
0:35:20
Olympic
2055
22
F
H
Adult Arrest
And the rejected rows are as expected:
Row
Date_Rptd
DATE_OCC
TIME_OCC
AREA_NAME
Rpt_Dist_No
Vict_Age
Vict_Sex
Vict_Descent
Status_Desc
1
2/7/2024
2/7/2024
0:00:15
West LA
853
55
f
o
ARREST
2
2/7/2024
2/7/2024
0:13:45
Newton
1309
0
M
A
Adult Arest
3
2/4/2024
2/4/2024
0:18:20
MISSION
1908
37
F
H
ADULT ARREST
4
2/7/2024
2/7/2024
0:23:20
Southwest
393
0
x
x
ARREST
5
2/4/2024
2/4/2024
0:26:40
HARBOR
587
39
F
W
ADULT ARREST
6
2/1/2024
1/30/2024
0:27:25
HOLLYWOOD
622
52
F
W
ADULT ARREST
7
2/2/2024
2/2/2024
0:28:52
WILSHIRE
724
0
X
X
ADULT ARREST
8
2/6/2024
2/6/2024
0:30:15
HARBOR
567
45
F
H
ADULT ARREST
9
2/4/2024
2/4/2024
0:30:45
Hollywd
645
25
f
b
ARREST
10
2/4/2024
2/3/2024
0:32:15
Van Nuys
926
0
x
x
ARREST
11
2/4/2024
2/4/2024
0:35:20
OLYMPIC
2055
22
F
H
ADULT ARREST
But, what if there were not sufficient numeric columns to create a distinct identity for each row? For example, table work.crimes2:
Row
Date_Rptd
AREA_NAME
Vict_Sex
Vict_Descent
Status_Desc
1
2/1/2024
HOLLYWOOD
F
W
ADULT ARREST
2
2/1/2024
Hollywood
F
W
Adult Arrest
3
2/2/2024
WILSHIRE
X
X
ADULT ARREST
4
2/2/2024
Wilshire
X
X
Adult Arrest
5
2/4/2024
HARBOR
F
W
ADULT ARREST
6
2/4/2024
Harbor
F
W
Adult Arrest
7
2/4/2024
Harbor
F
H
Adult Arrest
8
2/4/2024
Hollywd
f
b
ARREST
9
2/4/2024
Hollywood
F
B
Adult Arrest
10
2/4/2024
MISSION
F
H
ADULT ARREST
11
2/4/2024
Mission
F
H
Adult Arrest
12
2/4/2024
OLYMPIC
F
H
ADULT ARREST
13
2/4/2024
Olympic
F
H
Adult Arrest
14
2/4/2024
Van Nuys
X
X
Adult Arrest
15
2/4/2024
Van Nuys
x
x
ARREST
16
2/5/2024
Northeast
F
B
Juv Arrest
17
2/6/2024
HARBOR
F
H
ADULT ARREST
18
2/6/2024
Harbor
F
H
Adult Arrest
19
2/6/2024
N Hollywood
F
W
Adult Arrest
20
2/6/2024
Wilshire
M
W
Adult Arrest
21
2/7/2024
Newton
M
A
Adult Arrest
22
2/7/2024
Newton
M
A
Adult Arest
23
2/7/2024
Southwest
X
X
Adult Arrest
24
2/7/2024
Southwest
x
x
ARREST
25
2/7/2024
West LA
F
O
Adult Arrest
26
2/7/2024
West LA
f
o
ARREST
The character columns still contain irregularities, but now the only numeric column is Date_Rptd – and the values there are not unique. Using my previous trick of sorting BY _NUMERIC_, many rows are erroneously discarded as duplicates. Instead of the expected 15 rows, the nodups data set includes only 6:
Row
Date_Rptd
AREA_NAME
Vict_Sex
Vict_Descent
Status_Desc
1
2/1/2024
HOLLYWOOD
F
W
ADULT ARREST
2
2/2/2024
WILSHIRE
X
X
ADULT ARREST
3
2/4/2024
HARBOR
F
W
ADULT ARREST
4
2/5/2024
Northeast
F
B
Juv Arrest
5
2/6/2024
HARBOR
F
H
ADULT ARREST
6
2/7/2024
Southwest
x
x
ARREST
To ensure I remove only rows that are complete duplicates of another, I’ll have to sort on the contents of all columns. SAS provides that handy _ALL_ special SAS name variable list that specifies all of the variables in a data set, saving me from having to type them all individually. But the data in those character columns is messy. I’ll need to standardize it before relying on it for de-duplication. In one of my previous posts, Coding for Data Quality in SAS Viya Part 2 – Standardization, I showcased some powerful tools available to SAS Viya programmers (and SAS 9 programmers who license SAS Data Quality) for standardizing data. In this case, I’ll use only Base SAS functionality in the DATA step. This will work fine for the simple issues found in this small data set. Note that the DATA step uses another handy variable list, this time the name prefix list:
data crime_std;
set crime2;
array charvars[*] _character_;
/* Drop all variables with names starting with an underscore */
drop _:;
/* Make all character values UPPER case */
do _i=1 to dim(charvars);
charvars[_i]= propcase(charvars[_i]);
end;
/* Correct the known spelling errors */
Status_Desc=tranwrd(Status_Desc,'Arest','Arrest');
AREA_NAME=tranwrd(AREA_NAME,'Hollywd','Hollywood');
/* Standardize the truncated values for Adult Arrest */
if Status_Desc='Arrest' then Status_Desc='Adult Arrest';
run;
With the character data standardized, deduplication will be easy:
Row
Date_Rptd
AREA_NAME
Vict_Sex
Vict_Descent
Status_Desc
1
2/1/2024
Hollywood
F
W
Adult Arrest
2
2/1/2024
Hollywood
F
W
Adult Arrest
3
2/2/2024
Wilshire
X
X
Adult Arrest
4
2/2/2024
Wilshire
X
X
Adult Arrest
5
2/4/2024
Harbor
F
W
Adult Arrest
6
2/4/2024
Harbor
F
H
Adult Arrest
7
2/4/2024
Harbor
F
W
Adult Arrest
8
2/4/2024
Hollywood
F
B
Adult Arrest
9
2/4/2024
Hollywood
F
B
Adult Arrest
10
2/4/2024
Mission
F
H
Adult Arrest
11
2/4/2024
Mission
F
H
Adult Arrest
12
2/4/2024
Olympic
F
H
Adult Arrest
13
2/4/2024
Olympic
F
H
Adult Arrest
14
2/4/2024
Van Nuys
X
X
Adult Arrest
15
2/4/2024
Van Nuys
X
X
Adult Arrest
16
2/5/2024
Northeast
F
B
Juv Arrest
17
2/6/2024
Harbor
F
H
Adult Arrest
18
2/6/2024
Harbor
F
H
Adult Arrest
19
2/6/2024
N Hollywood
F
W
Adult Arrest
20
2/6/2024
Wilshire
M
W
Adult Arrest
21
2/7/2024
Newton
M
A
Adult Arrest
22
2/7/2024
Newton
M
A
Adult Arrest
23
2/7/2024
Southwest
X
X
Adult Arrest
24
2/7/2024
Southwest
X
X
Adult Arrest
25
2/7/2024
West La
F
O
Adult Arrest
26
2/7/2024
West La
F
O
Adult Arrest
I’ll use PROC SORT NODUPKEY with a BY _ALL_ statement:
proc sort data=crime_std
out=nodups
dupout=dups
nodupkey
;
by _all_;
run;
This produces the desired result and, as a bonus, the resulting data is much more presentable:
Row
Date_Rptd
AREA_NAME
Vict_Sex
Vict_Descent
Status_Desc
1
2/1/2024
Hollywood
F
W
Adult Arrest
2
2/2/2024
Wilshire
X
X
Adult Arrest
3
2/4/2024
Harbor
F
H
Adult Arrest
4
2/4/2024
Harbor
F
W
Adult Arrest
5
2/4/2024
Hollywood
F
B
Adult Arrest
6
2/4/2024
Mission
F
H
Adult Arrest
7
2/4/2024
Olympic
F
H
Adult Arrest
8
2/4/2024
Van Nuys
X
X
Adult Arrest
9
2/5/2024
Northeast
F
B
Juv Arrest
10
2/6/2024
Harbor
F
H
Adult Arrest
11
2/6/2024
N Hollywood
F
W
Adult Arrest
12
2/6/2024
Wilshire
M
W
Adult Arrest
13
2/7/2024
Newton
M
A
Adult Arrest
14
2/7/2024
Southwest
X
X
Adult Arrest
15
2/7/2024
West La
F
O
Adult Arrest
With the data pre-standardized, I can use either PROC SQL or PROC FedSQL to do my deduplicating, if desired. Both of these steps will produce results identical to the PROC SORT output:
proc sql;
/* Create the de-duplicated table */
create table nodups_sql as
select distinct *
from crime_std
;
quit;
proc FedSQL;
/* First, make sure the table does not exist.
FedSQL does not overwrite existing tables */
drop table nodups_FedSQL force;
/* Create the de-duplicated table */
create table nodups_FedSQL as
select distinct *
from crime_std
;
quit;
So today we covered some of the challenges and pitfalls when deduplicating data and discovered how special variable lists _NUMERIC_, _ALL_, and prefix: can make short work of long lists of variables when writing SAS programs. Do you have any favorite tips for de-duplicating data? What other special variable lists have you found useful in your SAS code?
Until next time, may the SAS be with you! Mark
Grab the ZIP file containing the code and data for this blog series from my GitHub at https://github.com/SASJedi/blogPackages/raw/main/data_manipulation_in_base_sas.zip Links to prior posts in this series:
Part 1 – Append
Part 2 – Sort
... View more