BookmarkSubscribeRSS Feed

Manipulating Data in Base SAS® Part 3 – Deduplicate

Started ‎06-07-2024 by
Modified ‎06-24-2024 by
Views 800

Duplicates in our data can badly skew the results of our analysis. In this post, I’ll cover data deduplication using PROC SORT with the NODUPKEY, OUT=, and DUPOUT= options. We will also look at using PROC SQL and PROC FedSQL for deduplication. It’s possible to deduplicate data that has messy text using PROC SORT if you have numeric columns that can be combined to make a row unique. Otherwise, the text data must be standardized so that identical values can be detected. First, a peek at the data. We’ll be working with the crime table. In this image, the rows we need to remove as duplicates are highlighted in red. The reason they should be discarded is noted in the Reason column. Of course, the actual crime table does not include the Disposition and Reason columns, or this would be a really easy task 😁

Row

Date_Rptd

DATE_OCC

TIME_OCC

AREA_NAME

Rpt_Dist_No

Vict_Age

Vict_Sex

Vict_Descent

Status_Desc

1

2/6/2024

1/21/2024

0:16:40

N Hollywood

1503

31

 

F

W

Adult Arrest

2

2/1/2024

1/30/2024

0:27:25

Hollywood

622

52

F

W

Adult Arrest

3

2/1/2024

1/30/2024

0:27:25

HOLLYWOOD

622

52

F

W

ADULT ARREST

4

2/2/2024

2/2/2024

0:28:52

Wilshire

724

0

X

X

Adult Arrest

5

2/2/2024

2/2/2024

0:28:52

WILSHIRE

724

0

X

X

ADULT ARREST

6

2/4/2024

2/3/2024

0:32:15

Van Nuys

926

0

X

X

Adult Arrest

7

2/4/2024

2/3/2024

0:32:15

Van Nuys

926

0

x

x

ARREST

8

2/4/2024

2/4/2024

0:18:20

Mission

1908

37

F

H

Adult Arrest

9

2/4/2024

2/4/2024

0:18:20

MISSION

1908

37

F

H

ADULT ARREST

10

2/4/2024

2/4/2024

0:26:40

Harbor

587

39

F

W

Adult Arrest

11

2/4/2024

2/4/2024

0:26:40

HARBOR

587

39

F

W

ADULT ARREST

12

2/5/2024

2/4/2024

0:28:40

Northeast

1152

33

F

B

Juv Arrest

13

2/4/2024

2/4/2024

0:30:45

Hollywood

645

25

F

B

Adult Arrest

14

2/4/2024

2/4/2024

0:30:45

Hollywd

645

25

f

b

ARREST

15

2/4/2024

2/4/2024

0:34:15

Harbor

504

0

F

H

Adult Arrest

16

2/4/2024

2/4/2024

0:35:20

Olympic

2055

22

F

H

Adult Arrest

17

2/4/2024

2/4/2024

0:35:20

OLYMPIC

2055

22

F

H

ADULT ARREST

18

2/6/2024

2/6/2024

0:14:00

Wilshire

787

51

M

W

Adult Arrest

19

2/6/2024

2/6/2024

0:30:15

Harbor

567

45

F

H

Adult Arrest

20

2/6/2024

2/6/2024

0:30:15

HARBOR

567

45

F

H

ADULT ARREST

21

2/7/2024

2/7/2024

0:00:15

West LA

853

55

F

O

Adult Arrest

22

2/7/2024

2/7/2024

0:00:15

West LA

853

55

f

o

ARREST

23

2/7/2024

2/7/2024

0:13:45

Newton

1309

0

M

A

Adult Arrest

24

2/7/2024

2/7/2024

0:13:45

Newton

1309

0

M

A

Adult Arest

25

2/7/2024

2/7/2024

0:23:20

Southwest

393

0

X

X

Adult Arrest

26

2/7/2024

2/7/2024

0:23:20

Southwest

393

0

x

x

ARREST


Rows 2 and 3,4 and 5, 8 and 9, 10 and 11, 16 and 17, and 19 and 20 were probably duplicated due to inconsistent text value casing. Row 7 has a truncated text value for Status_Desc, but is otherwise identical to row 6. Row 24 has a misspelling in the value for Status_Desc, but is otherwise identical to row 23.

 

Our goal is to deduplicate this table, producing this final result:

Row

Date_Rptd

DATE_OCC

TIME_OCC

AREA_NAME

Rpt_Dist_No

Vict_Age

Vict_Sex

Vict_Descent

Status_Desc

1

2/7/2024

2/7/2024

0:00:15

West LA

853

55

F

O

Adult Arrest

2

2/7/2024

2/7/2024

0:13:45

Newton

1309

0

M

A

Adult Arrest

3

2/6/2024

2/6/2024

0:14:00

Wilshire

787

51

M

W

Adult Arrest

4

2/6/2024

1/21/2024

0:16:40

N Hollywood

1503

31

F

W

Adult Arrest

5

2/4/2024

2/4/2024

0:18:20

Mission

1908

37

F

H

Adult Arrest

6

2/7/2024

2/7/2024

0:23:20

Southwest

393

0

X

X

Adult Arrest

7

2/4/2024

2/4/2024

0:26:40

Harbor

587

39

F

W

Adult Arrest

8

2/1/2024

1/30/2024

0:27:25

Hollywood

622

52

F

W

Adult Arrest

9

2/5/2024

2/4/2024

0:28:40

Northeast

1152

33

F

B

Juv Arrest

10

2/2/2024

2/2/2024

0:28:52

Wilshire

724

0

X

X

Adult Arrest

11

2/6/2024

2/6/2024

0:30:15

Harbor

567

45

F

H

Adult Arrest

12

2/4/2024

2/4/2024

0:30:45

Hollywood

645

25

F

B

Adult Arrest

13

2/4/2024

2/3/2024

0:32:15

Van Nuys

926

0

X

X

Adult Arrest

14

2/4/2024

2/4/2024

0:34:15

Harbor

504

0

F

H

Adult Arrest

15

2/4/2024

2/4/2024

0:35:20

Olympic

2055

22

F

H

Adult Arrest

 

As I look over the data, I can see that I’d need to standardize the text fields before I could use them for deduplication. But this table has a lot of numeric columns, and I’m willing to bet that the combination of Date_Rptd, DATE_OCC, TIME_OCC,  Rpt_Dist_No, and Vict_Age will be unique for each row. In my previous post, we discussed sorting using PROC SORT – but that procedure can do so much more! I can use PROC SORT with the NODUPKEY option to deduplicate this data if I use all of these numeric variables in my BY statement. I’ll want to see the rows that got rejected, just so I can verify that my scheme is working. I can use the DUPOUT= option for that. The PROC SORT code looks like this:
proc sort data=crime nodupkey
          out=nodups dupout=dups;
	by _numeric_;
run;
 
_NUMERIC_ is a very handy special SAS name variable list that specifies all numeric variables in a data set without having to list them individually. The results are just what I had hoped:

Row

Date_Rptd

DATE_OCC

TIME_OCC

AREA_NAME

Rpt_Dist_No

Vict_Age

Vict_Sex

Vict_Descent

Status_Desc

1

2/7/2024

2/7/2024

0:00:15

West LA

853

55

F

O

Adult Arrest

2

2/7/2024

2/7/2024

0:13:45

Newton

1309

0

M

A

Adult Arrest

3

2/6/2024

2/6/2024

0:14:00

Wilshire

787

51

M

W

Adult Arrest

4

2/6/2024

1/21/2024

0:16:40

N Hollywood

1503

31

F

W

Adult Arrest

5

2/4/2024

2/4/2024

0:18:20

Mission

1908

37

F

H

Adult Arrest

6

2/7/2024

2/7/2024

0:23:20

Southwest

393

0

X

X

Adult Arrest

7

2/4/2024

2/4/2024

0:26:40

Harbor

587

39

F

W

Adult Arrest

8

2/1/2024

1/30/2024

0:27:25

Hollywood

622

52

F

W

Adult Arrest

9

2/5/2024

2/4/2024

0:28:40

Northeast

1152

33

F

B

Juv Arrest

10

2/2/2024

2/2/2024

0:28:52

Wilshire

724

0

X

X

Adult Arrest

11

2/6/2024

2/6/2024

0:30:15

Harbor

567

45

F

H

Adult Arrest

12

2/4/2024

2/4/2024

0:30:45

Hollywood

645

25

F

B

Adult Arrest

13

2/4/2024

2/3/2024

0:32:15

Van Nuys

926

0

X

X

Adult Arrest

14

2/4/2024

2/4/2024

0:34:15

Harbor

504

0

F

H

Adult Arrest

15

2/4/2024

2/4/2024

0:35:20

Olympic

2055

22

F

H

Adult Arrest

And the rejected rows are as expected:

Row

Date_Rptd

DATE_OCC

TIME_OCC

AREA_NAME

Rpt_Dist_No

Vict_Age

Vict_Sex

Vict_Descent

Status_Desc

1

2/7/2024

2/7/2024

0:00:15

West LA

853

55

f

o

ARREST

2

2/7/2024

2/7/2024

0:13:45

Newton

1309

0

M

A

Adult Arest

3

2/4/2024

2/4/2024

0:18:20

MISSION

1908

37

F

H

ADULT ARREST

4

2/7/2024

2/7/2024

0:23:20

Southwest

393

0

x

x

ARREST

5

2/4/2024

2/4/2024

0:26:40

HARBOR

587

39

F

W

ADULT ARREST

6

2/1/2024

1/30/2024

0:27:25

HOLLYWOOD

622

52

F

W

ADULT ARREST

7

2/2/2024

2/2/2024

0:28:52

WILSHIRE

724

0

X

X

ADULT ARREST

8

2/6/2024

2/6/2024

0:30:15

HARBOR

567

45

F

H

ADULT ARREST

9

2/4/2024

2/4/2024

0:30:45

Hollywd

645

25

f

b

ARREST

10

2/4/2024

2/3/2024

0:32:15

Van Nuys

926

0

x

x

ARREST

11

2/4/2024

2/4/2024

0:35:20

OLYMPIC

2055

22

F

H

ADULT ARREST

 

But, what if there were not sufficient numeric columns to create a distinct identity for each row? For example, table work.crimes2:

Row

Date_Rptd

AREA_NAME

Vict_Sex

Vict_Descent

Status_Desc

1

2/1/2024

HOLLYWOOD

F

W

ADULT ARREST

2

2/1/2024

Hollywood

F

W

Adult Arrest

3

2/2/2024

WILSHIRE

X

X

ADULT ARREST

4

2/2/2024

Wilshire

X

X

Adult Arrest

5

2/4/2024

HARBOR

F

W

ADULT ARREST

6

2/4/2024

Harbor

F

W

Adult Arrest

7

2/4/2024

Harbor

F

H

Adult Arrest

8

2/4/2024

Hollywd

f

b

ARREST

9

2/4/2024

Hollywood

F

B

Adult Arrest

10

2/4/2024

MISSION

F

H

ADULT ARREST

11

2/4/2024

Mission

F

H

Adult Arrest

12

2/4/2024

OLYMPIC

F

H

ADULT ARREST

13

2/4/2024

Olympic

F

H

Adult Arrest

14

2/4/2024

Van Nuys

X

X

Adult Arrest

15

2/4/2024

Van Nuys

x

x

ARREST

16

2/5/2024

Northeast

F

B

Juv Arrest

17

2/6/2024

HARBOR

F

H

ADULT ARREST

18

2/6/2024

Harbor

F

H

Adult Arrest

19

2/6/2024

N Hollywood

F

W

Adult Arrest

20

2/6/2024

Wilshire

M

W

Adult Arrest

21

2/7/2024

Newton

M

A

Adult Arrest

22

2/7/2024

Newton

M

A

Adult Arest

23

2/7/2024

Southwest

X

X

Adult Arrest

24

2/7/2024

Southwest

x

x

ARREST

25

2/7/2024

West LA

F

O

Adult Arrest

26

2/7/2024

West LA

f

o

ARREST

 

The character columns still contain irregularities, but now the only numeric column is Date_Rptd – and the values there are not unique. Using my previous trick of sorting BY _NUMERIC_, many rows are erroneously discarded as duplicates. Instead of the expected 15 rows, the nodups data set includes only 6:

Row

Date_Rptd

AREA_NAME

Vict_Sex

Vict_Descent

Status_Desc

1

2/1/2024

HOLLYWOOD

F

W

ADULT ARREST

2

2/2/2024

WILSHIRE

X

X

ADULT ARREST

3

2/4/2024

HARBOR

F

W

ADULT ARREST

4

2/5/2024

Northeast

F

B

Juv Arrest

5

2/6/2024

HARBOR

F

H

ADULT ARREST

6

2/7/2024

Southwest

x

x

ARREST

 

To ensure I remove only rows that are complete duplicates of another, I’ll have to sort on the contents of all columns. SAS provides that handy _ALL_ special SAS name variable list that specifies all of the variables in a data set, saving me from having to type them all individually. But the data in those character columns is messy. I’ll need to standardize it before relying on it for de-duplication. In one of my previous posts, Coding for Data Quality in SAS Viya Part 2 – Standardization, I showcased some powerful tools available to SAS Viya programmers (and SAS 9 programmers who license SAS Data Quality) for standardizing data. In this case, I’ll use only Base SAS functionality in the DATA step. This will work fine for the simple issues found in this small data set. Note that the DATA step uses another handy variable list, this time the name prefix list:

data crime_std;
   set crime2;
   array charvars[*] _character_;
   /* Drop all variables with names starting with an underscore */
   drop _:;
   /* Make all character values UPPER case */
   do _i=1 to dim(charvars);
      charvars[_i]= propcase(charvars[_i]);
   end;
   /* Correct the known spelling errors */
   Status_Desc=tranwrd(Status_Desc,'Arest','Arrest');
   AREA_NAME=tranwrd(AREA_NAME,'Hollywd','Hollywood');
   /* Standardize the truncated values for Adult Arrest */
   if Status_Desc='Arrest' then Status_Desc='Adult Arrest';
run;

With the character data standardized, deduplication will be easy:

Row

Date_Rptd

AREA_NAME

Vict_Sex

Vict_Descent

Status_Desc

1

2/1/2024

Hollywood

F

W

Adult Arrest

2

2/1/2024

Hollywood

F

W

Adult Arrest

3

2/2/2024

Wilshire

X

X

Adult Arrest

4

2/2/2024

Wilshire

X

X

Adult Arrest

5

2/4/2024

Harbor

F

W

Adult Arrest

6

2/4/2024

Harbor

F

H

Adult Arrest

7

2/4/2024

Harbor

F

W

Adult Arrest

8

2/4/2024

Hollywood

F

B

Adult Arrest

9

2/4/2024

Hollywood

F

B

Adult Arrest

10

2/4/2024

Mission

F

H

Adult Arrest

11

2/4/2024

Mission

F

H

Adult Arrest

12

2/4/2024

Olympic

F

H

Adult Arrest

13

2/4/2024

Olympic

F

H

Adult Arrest

14

2/4/2024

Van Nuys

X

X

Adult Arrest

15

2/4/2024

Van Nuys

X

X

Adult Arrest

16

2/5/2024

Northeast

F

B

Juv Arrest

17

2/6/2024

Harbor

F

H

Adult Arrest

18

2/6/2024

Harbor

F

H

Adult Arrest

19

2/6/2024

N Hollywood

F

W

Adult Arrest

20

2/6/2024

Wilshire

M

W

Adult Arrest

21

2/7/2024

Newton

M

A

Adult Arrest

22

2/7/2024

Newton

M

A

Adult Arrest

23

2/7/2024

Southwest

X

X

Adult Arrest

24

2/7/2024

Southwest

X

X

Adult Arrest

25

2/7/2024

West La

F

O

Adult Arrest

26

2/7/2024

West La

F

O

Adult Arrest

I’ll use PROC SORT NODUPKEY with a BY _ALL_ statement:

proc sort data=crime_std
          out=nodups
          dupout=dups 
          nodupkey
   ;
   by _all_;
run;

This produces the desired result and, as a bonus, the resulting data is much more presentable:

Row

Date_Rptd

AREA_NAME

Vict_Sex

Vict_Descent

Status_Desc

1

2/1/2024

Hollywood

F

W

Adult Arrest

2

2/2/2024

Wilshire

X

X

Adult Arrest

3

2/4/2024

Harbor

F

H

Adult Arrest

4

2/4/2024

Harbor

F

W

Adult Arrest

5

2/4/2024

Hollywood

F

B

Adult Arrest

6

2/4/2024

Mission

F

H

Adult Arrest

7

2/4/2024

Olympic

F

H

Adult Arrest

8

2/4/2024

Van Nuys

X

X

Adult Arrest

9

2/5/2024

Northeast

F

B

Juv Arrest

10

2/6/2024

Harbor

F

H

Adult Arrest

11

2/6/2024

N Hollywood

F

W

Adult Arrest

12

2/6/2024

Wilshire

M

W

Adult Arrest

13

2/7/2024

Newton

M

A

Adult Arrest

14

2/7/2024

Southwest

X

X

Adult Arrest

15

2/7/2024

West La

F

O

Adult Arrest

 

With the data pre-standardized, I can use either PROC SQL or PROC FedSQL to do my deduplicating, if desired. Both of these steps will produce results identical to the PROC SORT output:

proc sql;
/* Create the de-duplicated table */
create table nodups_sql as
select distinct * 
   from crime_std
;
quit;

proc FedSQL;
/* First, make sure the table does not exist. 
   FedSQL does not overwrite existing tables */
drop table nodups_FedSQL force;
/* Create the de-duplicated table */
create table nodups_FedSQL as
select distinct * 
   from crime_std
;
quit;

So today we covered some of the challenges and pitfalls when deduplicating data and discovered how special variable lists _NUMERIC_, _ALL_, and prefix: can make short work of long lists of variables when writing SAS programs. Do you have any favorite tips for de-duplicating data? What other special variable lists have you found useful in your SAS code?


Until next time, may the SAS be with you!
Mark


Grab the ZIP file containing the code and data for this blog series from my GitHub at https://github.com/SASJedi/blogPackages/raw/main/data_manipulation_in_base_sas.zip Links to prior posts in this series:

Version history
Last update:
‎06-24-2024 08:10 AM
Updated by:
Contributors

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

Free course: Data Literacy Essentials

Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning  and boost your career prospects.

Get Started

Article Tags