Solved: Detect Duplicate Observations

dapenDaniel · Posted 12-16-2018 01:54 AM

Hello. I have a data set with 2 variables (ID and Var1). There are some duplicates (red color in the following example) in this data sets.

The left is what I have and the right one is what I want.

ID Var1 ID Var1

101 aa 101 aa

101 ab 101 ab

101 ac 101 ac

101 ad 101 ad

102 ca 102 ca

103 da 103 da

101 aa

101 ab

101 ac

101 ad

The code I am using is

Proc sort data=have out=want nodupkey;
by ID Var1;
run;

However, I found that only one duplicate observation was deleted and the reminaing duplicates are still in the data set. Did I use the wrong code? Thanks.

ballardw · Posted 12-17-2018 11:13 AM

@dapenDaniel wrote:

Hello. I have a data set with 2 variables (ID and Var1). There are some duplicates (red color in the following example) in this data sets.

The left is what I have and the right one is what I want.

ID Var1 ID Var1

101 aa 101   aa

101 ab 101 ab

101 ac 101 ac

101 ad   101 ad

102 ca 102 ca

103 da   103 da

101 aa

101 ab

101 ac

101 ad

The code I am using is
Proc sort data=have out=want nodupkey;
by ID Var1;
run;
However, I found that only one duplicate observation was deleted and the reminaing duplicates are still in the data set. Did I use the wrong code? Thanks.

We can't tell if you used the wrong code because you have not shown what you actually got for a result. If you did not get the result you show it may mean that your data is not quite what you think. If a value has a leading space then ' aa' is not equal to 'aa' and would not be deleted. This is a potential issue with any of the variables on the BY statement. Another is if a value is numeric the assigned format might not be displaying decimal values that would show up as not equal.

Example of this last issue:

data example;
   input x;
   format x f1.;
datalines;
1.5
1.6
1.7
;
run;

If you print or view the Example data set created above all of the values will appear as 2 because of the format but the actual values are different.

This is one reason we recommend posting data as an actual data step. Instructions here: https://communities.sas.com/t5/SAS-Communities-Library/How-to-create-a-data-step-version-of-your-dat... will show how to turn an existing SAS data set into data step code that can be pasted into a forum code box using the {i} icon or attached as text to show exactly what you have and that we can test code against.

View solution in original post

PeterClemmensen · Posted 12-16-2018 02:59 AM

When I run

data have;
input ID Var1 $;
datalines;
101 aa
101 ab
101 ac
101 ad
102 ca
103 da
101 aa
101 ab
101 ac
101 ad
;

proc sort data=have out=want nodupkey;
   by ID Var1;
run;

The log says

The DATA to DATA Step Macro
Blog: SASnrd

Jagadishkatam · Posted 12-16-2018 04:19 AM

Alternatively with first. in the data step.

data have;
input ID Var1 $;
datalines;
101 aa
101 ab
101 ac
101 ad
102 ca
103 da
101 aa
101 ab
101 ac
101 ad
;

proc sort data=have;
by ID Var1;
run;

data want;
set have;
by ID Var1;
if first.Var1;
run;

Thanks,
Jag

novinosrin · Posted 12-16-2018 11:31 AM

data have;
input ID Var1 $;
datalines;
101 aa
101 ab
101 ac
101 ad
102 ca
103 da
101 aa
101 ab
101 ac
101 ad
;



proc sort data =have out= want noduprecs;
by id var1 ;
run;

/*or*/

proc sql;
create table want as
select distinct *
from have;
quit;

mkeintz · Posted 12-16-2018 12:58 PM

@novinosrin

Be careful recommending NODUPRECS ... because it only deletes duplicate records that are adjacent to each other. For instance the following proc sort does NOT delete any records, even though records 1 and 3 are duplicates:

data have;
  input id x y z;
datalines;
1 1 1 1
1 2 2 2
1 1 1 1
run;
proc sort data=have out=sorted_have noduprecs;
  by id;
run;

I suggest keeping nodupkey, but use the variable proxy _ALL_ as the last sort key, as in

proc sort data=have out=sorted_have_all nodupkeys;
  by id _ALL_;
run;

The technique here is to list all the desired primary sort keys first, and then append _ALL_ at the end of the by-list. It doesn't matter that ID is both the primary sort key, and also a subordinate sort key because it is part of _ALL_. The subordinate usage will have no impact on resulting sort order.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

novinosrin · Posted 12-17-2018 12:35 PM

Thank you @mkeintz . I know and love it when you've helped and continue to help me learn numerous times. What I assumed was OP wanting select unique records and my understanding of noduprecs( exclude duplicate records and pick one record) rather than dup keys which is akin to select distinct * in proc sql, and hence i posted that lazy version too.

Nonetheless when and if you participate, i tend to pay sincere attention and kept thinking. So basically I think

proc sort

by _all_

with a noduprecs

should be equal to

proc sort

by all variables listed one after another with a nodupkey

Also OP's requirement seems to pick a combination of all vars as unique obs than based keys unless of course all keys making it a composite combination.


data have;
  input id x y z;
datalines;
1 1 1 1
1 2 2 2
1 1 1 1
;
run;

proc sort noduprecs;
by _all_;
run;

I wonder if my understanding is correct, distinct sql is perhaps the easiest/laziest and convenient way to go 🙂 Btw, my apologies for the late acknowledgement as I slept like a baby all weekend and barely moved. I hope you have a nice day and week ahead.

mkeintz · Posted 12-17-2018 01:40 PM

@novinosrin:

Yes. To my knowledge the code

proc sort data=have out=want noduprecs;
  by _all_;
run;

produces the same results as

proc sql ;
  create table want as
  select distinct *
  from have;
quit;

And of course in the case of BY _ALL_, keywords NODUPKEY and NODUPRECS are the same.

The problem with BY _ALL_, as with SELECT *, is that, while duplicates are removed, you may not end up with the desired record order, for instance when ID is not the first variable in the data set. That's when you might choose to precede _ALL_ with the variables you want controlling the output order.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

ballardw · Posted 12-17-2018 11:13 AM

@dapenDaniel wrote:

Hello. I have a data set with 2 variables (ID and Var1). There are some duplicates (red color in the following example) in this data sets.

The left is what I have and the right one is what I want.

ID Var1 ID Var1

101 aa 101   aa

101 ab 101 ab

101 ac 101 ac

101 ad   101 ad

102 ca 102 ca

103 da   103 da

101 aa

101 ab

101 ac

101 ad

The code I am using is
Proc sort data=have out=want nodupkey;
by ID Var1;
run;
However, I found that only one duplicate observation was deleted and the reminaing duplicates are still in the data set. Did I use the wrong code? Thanks.

We can't tell if you used the wrong code because you have not shown what you actually got for a result. If you did not get the result you show it may mean that your data is not quite what you think. If a value has a leading space then ' aa' is not equal to 'aa' and would not be deleted. This is a potential issue with any of the variables on the BY statement. Another is if a value is numeric the assigned format might not be displaying decimal values that would show up as not equal.

Example of this last issue:

data example;
   input x;
   format x f1.;
datalines;
1.5
1.6
1.7
;
run;

If you print or view the Example data set created above all of the values will appear as 2 because of the format but the actual values are different.

This is one reason we recommend posting data as an actual data step. Instructions here: https://communities.sas.com/t5/SAS-Communities-Library/How-to-create-a-data-step-version-of-your-dat... will show how to turn an existing SAS data set into data step code that can be pasted into a forum code box using the {i} icon or attached as text to show exactly what you have and that we can test code against.

dapenDaniel · Posted 12-17-2018 12:26 PM

Thank you. There is indeed some leading spaces. I have solved this problem.

Detect Duplicate Observations

Re: Detect Duplicate Observations

Re: Detect Duplicate Observations

Re: Detect Duplicate Observations

Re: Detect Duplicate Observations

Re: Detect Duplicate Observations

Re: Detect Duplicate Observations

Re: Detect Duplicate Observations

Re: Detect Duplicate Observations

Re: Detect Duplicate Observations