Manipulating Data in Base SAS® Part 3 – Deduplicate

1 Like

Duplicates in our data can badly skew the results of our analysis. In this post, I’ll cover data deduplication using PROC SORT with the NODUPKEY, OUT=, and DUPOUT= options. We will also look at using PROC SQL and PROC FedSQL for deduplication. It’s possible to deduplicate data that has messy text using PROC SORT if you have numeric columns that can be combined to make a row unique. Otherwise, the text data must be standardized so that identical values can be detected. First, a peek at the data. We’ll be working with the crime table. In this image, the rows we need to remove as duplicates are highlighted in red. The reason they should be discarded is noted in the Reason column. Of course, the actual crime table does not include the Disposition and Reason columns, or this would be a really easy task 😁

Row	Date_Rptd	DATE_OCC	TIME_OCC	AREA_NAME	Rpt_Dist_No	Vict_Age	Vict_Sex	Vict_Descent	Status_Desc
1	2/6/2024	1/21/2024	0:16:40	N Hollywood	1503	31	F	W	Adult Arrest
2	2/1/2024	1/30/2024	0:27:25	Hollywood	622	52	F	W	Adult Arrest
3	2/1/2024	1/30/2024	0:27:25	HOLLYWOOD	622	52	F	W	ADULT ARREST
4	2/2/2024	2/2/2024	0:28:52	Wilshire	724	0	X	X	Adult Arrest
5	2/2/2024	2/2/2024	0:28:52	WILSHIRE	724	0	X	X	ADULT ARREST
6	2/4/2024	2/3/2024	0:32:15	Van Nuys	926	0	X	X	Adult Arrest
7	2/4/2024	2/3/2024	0:32:15	Van Nuys	926	0	x	x	ARREST
8	2/4/2024	2/4/2024	0:18:20	Mission	1908	37	F	H	Adult Arrest
9	2/4/2024	2/4/2024	0:18:20	MISSION	1908	37	F	H	ADULT ARREST
10	2/4/2024	2/4/2024	0:26:40	Harbor	587	39	F	W	Adult Arrest
11	2/4/2024	2/4/2024	0:26:40	HARBOR	587	39	F	W	ADULT ARREST
12	2/5/2024	2/4/2024	0:28:40	Northeast	1152	33	F	B	Juv Arrest
13	2/4/2024	2/4/2024	0:30:45	Hollywood	645	25	F	B	Adult Arrest
14	2/4/2024	2/4/2024	0:30:45	Hollywd	645	25	f	b	ARREST
15	2/4/2024	2/4/2024	0:34:15	Harbor	504	0	F	H	Adult Arrest
16	2/4/2024	2/4/2024	0:35:20	Olympic	2055	22	F	H	Adult Arrest
17	2/4/2024	2/4/2024	0:35:20	OLYMPIC	2055	22	F	H	ADULT ARREST
18	2/6/2024	2/6/2024	0:14:00	Wilshire	787	51	M	W	Adult Arrest
19	2/6/2024	2/6/2024	0:30:15	Harbor	567	45	F	H	Adult Arrest
20	2/6/2024	2/6/2024	0:30:15	HARBOR	567	45	F	H	ADULT ARREST
21	2/7/2024	2/7/2024	0:00:15	West LA	853	55	F	O	Adult Arrest
22	2/7/2024	2/7/2024	0:00:15	West LA	853	55	f	o	ARREST
23	2/7/2024	2/7/2024	0:13:45	Newton	1309	0	M	A	Adult Arrest
24	2/7/2024	2/7/2024	0:13:45	Newton	1309	0	M	A	Adult Arest
25	2/7/2024	2/7/2024	0:23:20	Southwest	393	0	X	X	Adult Arrest
26	2/7/2024	2/7/2024	0:23:20	Southwest	393	0	x	x	ARREST

Rows 2 and 3,4 and 5, 8 and 9, 10 and 11, 16 and 17, and 19 and 20 were probably duplicated due to inconsistent text value casing. Row 7 has a truncated text value for Status_Desc, but is otherwise identical to row 6. Row 24 has a misspelling in the value for Status_Desc, but is otherwise identical to row 23.

Our goal is to deduplicate this table, producing this final result:

Row	Date_Rptd	DATE_OCC	TIME_OCC	AREA_NAME	Rpt_Dist_No	Vict_Age	Vict_Sex	Vict_Descent	Status_Desc
1	2/7/2024	2/7/2024	0:00:15	West LA	853	55	F	O	Adult Arrest
2	2/7/2024	2/7/2024	0:13:45	Newton	1309	0	M	A	Adult Arrest
3	2/6/2024	2/6/2024	0:14:00	Wilshire	787	51	M	W	Adult Arrest
4	2/6/2024	1/21/2024	0:16:40	N Hollywood	1503	31	F	W	Adult Arrest
5	2/4/2024	2/4/2024	0:18:20	Mission	1908	37	F	H	Adult Arrest
6	2/7/2024	2/7/2024	0:23:20	Southwest	393	0	X	X	Adult Arrest
7	2/4/2024	2/4/2024	0:26:40	Harbor	587	39	F	W	Adult Arrest
8	2/1/2024	1/30/2024	0:27:25	Hollywood	622	52	F	W	Adult Arrest
9	2/5/2024	2/4/2024	0:28:40	Northeast	1152	33	F	B	Juv Arrest
10	2/2/2024	2/2/2024	0:28:52	Wilshire	724	0	X	X	Adult Arrest
11	2/6/2024	2/6/2024	0:30:15	Harbor	567	45	F	H	Adult Arrest
12	2/4/2024	2/4/2024	0:30:45	Hollywood	645	25	F	B	Adult Arrest
13	2/4/2024	2/3/2024	0:32:15	Van Nuys	926	0	X	X	Adult Arrest
14	2/4/2024	2/4/2024	0:34:15	Harbor	504	0	F	H	Adult Arrest
15	2/4/2024	2/4/2024	0:35:20	Olympic	2055	22	F	H	Adult Arrest

As I look over the data, I can see that I’d need to standardize the text fields before I could use them for deduplication. But this table has a lot of numeric columns, and I’m willing to bet that the combination of Date_Rptd, DATE_OCC, TIME_OCC, Rpt_Dist_No, and Vict_Age will be unique for each row. In my previous post, we discussed sorting using PROC SORT – but that procedure can do so much more! I can use PROC SORT with the NODUPKEY option to deduplicate this data if I use all of these numeric variables in my BY statement. I’ll want to see the rows that got rejected, just so I can verify that my scheme is working. I can use the DUPOUT= option for that. The PROC SORT code looks like this:

proc sort data=crime nodupkey
          out=nodups dupout=dups;
	by _numeric_;
run;

_NUMERIC_ is a very handy special SAS name variable list that specifies all numeric variables in a data set without having to list them individually. The results are just what I had hoped:

Row	Date_Rptd	DATE_OCC	TIME_OCC	AREA_NAME	Rpt_Dist_No	Vict_Age	Vict_Sex	Vict_Descent	Status_Desc
1	2/7/2024	2/7/2024	0:00:15	West LA	853	55	F	O	Adult Arrest
2	2/7/2024	2/7/2024	0:13:45	Newton	1309	0	M	A	Adult Arrest
3	2/6/2024	2/6/2024	0:14:00	Wilshire	787	51	M	W	Adult Arrest
4	2/6/2024	1/21/2024	0:16:40	N Hollywood	1503	31	F	W	Adult Arrest
5	2/4/2024	2/4/2024	0:18:20	Mission	1908	37	F	H	Adult Arrest
6	2/7/2024	2/7/2024	0:23:20	Southwest	393	0	X	X	Adult Arrest
7	2/4/2024	2/4/2024	0:26:40	Harbor	587	39	F	W	Adult Arrest
8	2/1/2024	1/30/2024	0:27:25	Hollywood	622	52	F	W	Adult Arrest
9	2/5/2024	2/4/2024	0:28:40	Northeast	1152	33	F	B	Juv Arrest
10	2/2/2024	2/2/2024	0:28:52	Wilshire	724	0	X	X	Adult Arrest
11	2/6/2024	2/6/2024	0:30:15	Harbor	567	45	F	H	Adult Arrest
12	2/4/2024	2/4/2024	0:30:45	Hollywood	645	25	F	B	Adult Arrest
13	2/4/2024	2/3/2024	0:32:15	Van Nuys	926	0	X	X	Adult Arrest
14	2/4/2024	2/4/2024	0:34:15	Harbor	504	0	F	H	Adult Arrest
15	2/4/2024	2/4/2024	0:35:20	Olympic	2055	22	F	H	Adult Arrest

And the rejected rows are as expected:

Row	Date_Rptd	DATE_OCC	TIME_OCC	AREA_NAME	Rpt_Dist_No	Vict_Age	Vict_Sex	Vict_Descent	Status_Desc
1	2/7/2024	2/7/2024	0:00:15	West LA	853	55	f	o	ARREST
2	2/7/2024	2/7/2024	0:13:45	Newton	1309	0	M	A	Adult Arest
3	2/4/2024	2/4/2024	0:18:20	MISSION	1908	37	F	H	ADULT ARREST
4	2/7/2024	2/7/2024	0:23:20	Southwest	393	0	x	x	ARREST
5	2/4/2024	2/4/2024	0:26:40	HARBOR	587	39	F	W	ADULT ARREST
6	2/1/2024	1/30/2024	0:27:25	HOLLYWOOD	622	52	F	W	ADULT ARREST
7	2/2/2024	2/2/2024	0:28:52	WILSHIRE	724	0	X	X	ADULT ARREST
8	2/6/2024	2/6/2024	0:30:15	HARBOR	567	45	F	H	ADULT ARREST
9	2/4/2024	2/4/2024	0:30:45	Hollywd	645	25	f	b	ARREST
10	2/4/2024	2/3/2024	0:32:15	Van Nuys	926	0	x	x	ARREST
11	2/4/2024	2/4/2024	0:35:20	OLYMPIC	2055	22	F	H	ADULT ARREST

But, what if there were not sufficient numeric columns to create a distinct identity for each row? For example, table work.crimes2:

Row	Date_Rptd	AREA_NAME	Vict_Sex	Vict_Descent	Status_Desc
1	2/1/2024	HOLLYWOOD	F	W	ADULT ARREST
2	2/1/2024	Hollywood	F	W	Adult Arrest
3	2/2/2024	WILSHIRE	X	X	ADULT ARREST
4	2/2/2024	Wilshire	X	X	Adult Arrest
5	2/4/2024	HARBOR	F	W	ADULT ARREST
6	2/4/2024	Harbor	F	W	Adult Arrest
7	2/4/2024	Harbor	F	H	Adult Arrest
8	2/4/2024	Hollywd	f	b	ARREST
9	2/4/2024	Hollywood	F	B	Adult Arrest
10	2/4/2024	MISSION	F	H	ADULT ARREST
11	2/4/2024	Mission	F	H	Adult Arrest
12	2/4/2024	OLYMPIC	F	H	ADULT ARREST
13	2/4/2024	Olympic	F	H	Adult Arrest
14	2/4/2024	Van Nuys	X	X	Adult Arrest
15	2/4/2024	Van Nuys	x	x	ARREST
16	2/5/2024	Northeast	F	B	Juv Arrest
17	2/6/2024	HARBOR	F	H	ADULT ARREST
18	2/6/2024	Harbor	F	H	Adult Arrest
19	2/6/2024	N Hollywood	F	W	Adult Arrest
20	2/6/2024	Wilshire	M	W	Adult Arrest
21	2/7/2024	Newton	M	A	Adult Arrest
22	2/7/2024	Newton	M	A	Adult Arest
23	2/7/2024	Southwest	X	X	Adult Arrest
24	2/7/2024	Southwest	x	x	ARREST
25	2/7/2024	West LA	F	O	Adult Arrest
26	2/7/2024	West LA	f	o	ARREST

The character columns still contain irregularities, but now the only numeric column is Date_Rptd – and the values there are not unique. Using my previous trick of sorting BY _NUMERIC_, many rows are erroneously discarded as duplicates. Instead of the expected 15 rows, the nodups data set includes only 6:

Row	Date_Rptd	AREA_NAME	Vict_Sex	Vict_Descent	Status_Desc
1	2/1/2024	HOLLYWOOD	F	W	ADULT ARREST
2	2/2/2024	WILSHIRE	X	X	ADULT ARREST
3	2/4/2024	HARBOR	F	W	ADULT ARREST
4	2/5/2024	Northeast	F	B	Juv Arrest
5	2/6/2024	HARBOR	F	H	ADULT ARREST
6	2/7/2024	Southwest	x	x	ARREST

To ensure I remove only rows that are complete duplicates of another, I’ll have to sort on the contents of all columns. SAS provides that handy _ALL_ special SAS name variable list that specifies all of the variables in a data set, saving me from having to type them all individually. But the data in those character columns is messy. I’ll need to standardize it before relying on it for de-duplication. In one of my previous posts, Coding for Data Quality in SAS Viya Part 2 – Standardization, I showcased some powerful tools available to SAS Viya programmers (and SAS 9 programmers who license SAS Data Quality) for standardizing data. In this case, I’ll use only Base SAS functionality in the DATA step. This will work fine for the simple issues found in this small data set. Note that the DATA step uses another handy variable list, this time the name prefix list:

data crime_std;
   set crime2;
   array charvars[*] _character_;
   /* Drop all variables with names starting with an underscore */
   drop _:;
   /* Make all character values UPPER case */
   do _i=1 to dim(charvars);
      charvars[_i]= propcase(charvars[_i]);
   end;
   /* Correct the known spelling errors */
   Status_Desc=tranwrd(Status_Desc,'Arest','Arrest');
   AREA_NAME=tranwrd(AREA_NAME,'Hollywd','Hollywood');
   /* Standardize the truncated values for Adult Arrest */
   if Status_Desc='Arrest' then Status_Desc='Adult Arrest';
run;

With the character data standardized, deduplication will be easy:

Row	Date_Rptd	AREA_NAME	Vict_Sex	Vict_Descent	Status_Desc
1	2/1/2024	Hollywood	F	W	Adult Arrest
2	2/1/2024	Hollywood	F	W	Adult Arrest
3	2/2/2024	Wilshire	X	X	Adult Arrest
4	2/2/2024	Wilshire	X	X	Adult Arrest
5	2/4/2024	Harbor	F	W	Adult Arrest
6	2/4/2024	Harbor	F	H	Adult Arrest
7	2/4/2024	Harbor	F	W	Adult Arrest
8	2/4/2024	Hollywood	F	B	Adult Arrest
9	2/4/2024	Hollywood	F	B	Adult Arrest
10	2/4/2024	Mission	F	H	Adult Arrest
11	2/4/2024	Mission	F	H	Adult Arrest
12	2/4/2024	Olympic	F	H	Adult Arrest
13	2/4/2024	Olympic	F	H	Adult Arrest
14	2/4/2024	Van Nuys	X	X	Adult Arrest
15	2/4/2024	Van Nuys	X	X	Adult Arrest
16	2/5/2024	Northeast	F	B	Juv Arrest
17	2/6/2024	Harbor	F	H	Adult Arrest
18	2/6/2024	Harbor	F	H	Adult Arrest
19	2/6/2024	N Hollywood	F	W	Adult Arrest
20	2/6/2024	Wilshire	M	W	Adult Arrest
21	2/7/2024	Newton	M	A	Adult Arrest
22	2/7/2024	Newton	M	A	Adult Arrest
23	2/7/2024	Southwest	X	X	Adult Arrest
24	2/7/2024	Southwest	X	X	Adult Arrest
25	2/7/2024	West La	F	O	Adult Arrest
26	2/7/2024	West La	F	O	Adult Arrest

I’ll use PROC SORT NODUPKEY with a BY _ALL_ statement:

proc sort data=crime_std
          out=nodups
          dupout=dups 
          nodupkey
   ;
   by _all_;
run;

This produces the desired result and, as a bonus, the resulting data is much more presentable:

Row	Date_Rptd	AREA_NAME	Vict_Sex	Vict_Descent	Status_Desc
1	2/1/2024	Hollywood	F	W	Adult Arrest
2	2/2/2024	Wilshire	X	X	Adult Arrest
3	2/4/2024	Harbor	F	H	Adult Arrest
4	2/4/2024	Harbor	F	W	Adult Arrest
5	2/4/2024	Hollywood	F	B	Adult Arrest
6	2/4/2024	Mission	F	H	Adult Arrest
7	2/4/2024	Olympic	F	H	Adult Arrest
8	2/4/2024	Van Nuys	X	X	Adult Arrest
9	2/5/2024	Northeast	F	B	Juv Arrest
10	2/6/2024	Harbor	F	H	Adult Arrest
11	2/6/2024	N Hollywood	F	W	Adult Arrest
12	2/6/2024	Wilshire	M	W	Adult Arrest
13	2/7/2024	Newton	M	A	Adult Arrest
14	2/7/2024	Southwest	X	X	Adult Arrest
15	2/7/2024	West La	F	O	Adult Arrest

With the data pre-standardized, I can use either PROC SQL or PROC FedSQL to do my deduplicating, if desired. Both of these steps will produce results identical to the PROC SORT output:

proc sql;
/* Create the de-duplicated table */
create table nodups_sql as
select distinct * 
   from crime_std
;
quit;

proc FedSQL;
/* First, make sure the table does not exist. 
   FedSQL does not overwrite existing tables */
drop table nodups_FedSQL force;
/* Create the de-duplicated table */
create table nodups_FedSQL as
select distinct * 
   from crime_std
;
quit;

So today we covered some of the challenges and pitfalls when deduplicating data and discovered how special variable lists _NUMERIC_, _ALL_, and prefix: can make short work of long lists of variables when writing SAS programs. Do you have any favorite tips for de-duplicating data? What other special variable lists have you found useful in your SAS code?

Until next time, may the SAS be with you!
Mark

Grab the ZIP file containing the code and data for this blog series from my GitHub at https://github.com/SASJedi/blogPackages/raw/main/data_manipulation_in_base_sas.zip Links to prior posts in this series:

SAS Communities Library