BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Quartz | Level 8

## Detect Duplicate Observations by Different Grouping

Good Monday Community,

I HAVE the following customers data, which is not sorted (I am sorting it here just to get the concept across):

 ID_1 ID_2 State A123456 1ABCDE FL A123456 1ABCDE GA B984536 2HGJHW NC C567342 3KJEWG IL C567342 4UYMVS IL D723410 5PPOAD WY D723410 6NNTGC AL

I need to create a variable that identify the TYPE of duplicates in my data, so that I can deal with them accordingly. As you can tell from the above, that there are four types of duplicates as in the WANT data below:

 ID_1 ID_2 State Type A123456 1ABCDE FL Same IDs, Different States A123456 1ABCDE GA Same IDs, Different States B984536 2HGJHW NC Unique C567342 3KJEWG IL Same ID_1 and State, Different ID_2 C567342 4UYMVS IL Same ID_1 and State, Different ID_2 D723410 5PPOAD WY Same ID_1, Different ID_2 and State D723410 6NNTGC AL Same ID_1, Different ID_2 and State

I run my code in SAS EG using Proc SQL, and I appreciate any help

Many thanks!

1 ACCEPTED SOLUTION

Accepted Solutions
Tourmaline | Level 20

## Re: Detect Duplicate Observations by Different Grouping

``````data have;
input (ID_1	ID_2	State) (:\$20.);
cards;
A123456	1ABCDE	FL
A123456	1ABCDE	GA
B984536	2HGJHW	NC
C567342	3KJEWG	IL
C567342	4UYMVS	IL
D723410	5PPOAD	WY
D723410	6NNTGC	AL
;
proc sql;
create table want as
select id_1, id_2,state ,ifc(count( distinct state)>1, 'Same IDs, Different States',' ' ) as Type
from have
group by id_1,id_2
having type is not missing
union
select id_1,id_2, state ,ifc(count( distinct id_2)>1,'Same ID_1 and State, Different ID_2',' ') as Type
from have
group by id_1,state
having type is not missing
union
select id_1,id_2,state,ifc(count( distinct id_2)>1 and count( distinct state)>1,'Same ID_1, Different ID_2 and State',' ') as Type
from have
group by id_1
having type is not missing
union
select *, ifc(count( id_1)=1,'Unique',' ') as Type
from have
group by id_1
having type is not missing;
quit;``````
14 REPLIES 14
Diamond | Level 26

## Re: Detect Duplicate Observations by Different Grouping

This code is untested, please in future post test data in the form of a datastep!

```proc sql;
create table want as
select a.*,
case   when exists(select id_1 from have where id_1=a.id_1 and state ne a.state)            then "Same IDs, Different States"         when exists(select id_1 from have where id_1=a.id_1 and state=a._state and id_2 ne a.id_2)            then "Same ID_1 and state, different ID_2"         /* Other checks here */         else "Unique" end as type  from   have a;quit;```
PROC Star

## Re: Detect Duplicate Observations by Different Grouping

This solution requires 1 sort and 1 data step:

``````data have;
input (ID_1	ID_2	State) (:\$20.);
cards;
A123456	1ABCDE	FL
A123456	1ABCDE	GA
B984536	2HGJHW	NC
C567342	3KJEWG	IL
C567342	4UYMVS	IL
D723410	5PPOAD	WY
D723410	6NNTGC	AL
;

proc sort data=have out=need;
by ID_1 ID_2 state;
run;

data want (drop=_:);
retain _ONE 1;
do until (last.id_1);
set need;
by id_1 id_2;
if _n_=1 and first.id_1 then do;
declare hash h (suminc:'_one');
h.definekey('state');
h.definedone();
end;
if first.id_1=1 and last.id_1=1 then continue;  /*treated in later do loop*/
else if first.id_2=0 or last.id_2=0 then continue; /*treated in later do loop*/
else if h.find()^=0 then h.add();  /* Add the _ONE value for each instance */
end;
length matchtype \$20;
do until (last.id_1);
set need;
by id_1 id_2;
if first.id_1=1 and last.id_1=1 then matchtype='Unique ID_1';
else if first.id_2=0 or last.id_2=0 then matchtype='Different State';
else do;
h.sum(sum:_freq);
if _freq>1 then matchtype='Different ID_2';
else matchtype='Diff ID_2 and State';
end;
output;
end;
h.clear();
run;``````

1. The sort by ID_1 ID_2 allows for straightforward identification of the completely "unique ID_1" and the "different state" matchtypes in the data step.  Hence the "continue" for these conditions in the first DO loop.
2. All other observations are recorded in the hash object H, keyed only on STATE.
This hash is declared with the SUMINC:'_ONE' argument, which maintains a sum for each key.  Specifically it adds the value of variable _ONE (which need not always=1)  for the first ADD method, and all subsequent FIND methods for each key.  I.e. it adds the value for every successful ADD and FIND method.

This summary is retrieved into variable _FREQ in the later use of the .SUM method.
3. In the second loop
1. The easy-to-determine "unique ID_1" and "different state" records are identified
2. For other records the hash object for that value of STATE is examined to get the frequency of access to each STATE.  If _freq>2 then match type is "different ID_2", otherwise "different ID_2 and state".
--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Quartz | Level 8

## Re: Detect Duplicate Observations by Different Grouping

Thank you for your code. It works perfectly using the mock data. When I ran in my data, which I can not share of course, there are only two TYPES: 'Unique' and 'Different State'. The second TYPE is an error in almost all of the places picked. In these places, it shouldn't be 'Unique", but not 'Different State' either.

Any suggestions?

Many thanks

PROC Star

## Re: Detect Duplicate Observations by Different Grouping

Note that you can get a detailed compilation of the categories you want in this way:

proc freq data=have;

tables ID_1 * ID_2 * state / noprint out=want (where=(count > 1) drop=percent);

run;

proc print data=want;

run;

Also note that your original data may contain differences that belong in more than one category.  Here is an example of what you could encounter:

 ID_1 ID_2 State A123456 1ABCDE FL A123456 1ABCDE GA B984536 1ABCDE GA

So how do you categorize the second observation?  It resembles both the first and the third, but with different variables changing.

Quartz | Level 8

## Re: Detect Duplicate Observations by Different Grouping

Thanks for your proc freq and proc print.

Eventually, this is what I am after: to know the breakdown of each group. My data contains multiple million records, and I am running into:

ERROR: The SAS System stopped processing this step because of insufficient memory.

As for your other question/suggestion, it is not possible to have the same ID_2 for two different ID_1. The data is cleaner in that regard.

PROC Star

## Re: Detect Duplicate Observations by Different Grouping

OK, so the same ID_2 cannot appear under different ID_1 values.  That still doesn't eliminate the problem:

 ID_1 ID_2 State A123456 1ABCDE FL A123456 1ABCDE GA A123456 2ABCDE GA

Now the second observation belongs in two different categories, depending on whether you compare to observation 1 or observation 3.

Quartz | Level 8

## Re: Detect Duplicate Observations by Different Grouping

You are right: theoretically, that is a possibility that can exist in the data.

PROC Star

## Re: Detect Duplicate Observations by Different Grouping

Is the data completely unsorted?   Or is it sorted by ID1, but not sorted within each ID1 group?

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Quartz | Level 8

## Re: Detect Duplicate Observations by Different Grouping

Completely unsorted.

Tourmaline | Level 20

## Re: Detect Duplicate Observations by Different Grouping

is your sample a good representative of your real meaning you have exactly the same number of columns i.e id1, id2 and state? Can you confirm

Quartz | Level 8

## Re: Detect Duplicate Observations by Different Grouping

I have multiple columns in the HAVE data (let us say Col1-Col10), but in the WANT data I only need these columns in order to run a breakdown of the groups: ID_1, ID_2, State, and the new "TYPE" column.

Thanks.

Tourmaline | Level 20

## Re: Detect Duplicate Observations by Different Grouping

``````data have;
input (ID_1	ID_2	State) (:\$20.);
cards;
A123456	1ABCDE	FL
A123456	1ABCDE	GA
B984536	2HGJHW	NC
C567342	3KJEWG	IL
C567342	4UYMVS	IL
D723410	5PPOAD	WY
D723410	6NNTGC	AL
;
proc sql;
create table want as
select id_1, id_2,state ,ifc(count( distinct state)>1, 'Same IDs, Different States',' ' ) as Type
from have
group by id_1,id_2
having type is not missing
union
select id_1,id_2, state ,ifc(count( distinct id_2)>1,'Same ID_1 and State, Different ID_2',' ') as Type
from have
group by id_1,state
having type is not missing
union
select id_1,id_2,state,ifc(count( distinct id_2)>1 and count( distinct state)>1,'Same ID_1, Different ID_2 and State',' ') as Type
from have
group by id_1
having type is not missing
union
select *, ifc(count( id_1)=1,'Unique',' ') as Type
from have
group by id_1
having type is not missing;
quit;``````
Quartz | Level 8

## Re: Detect Duplicate Observations by Different Grouping

Thank you for the suggested code. It works perfectly in the mock data. When I use my data, which I can not share of course, the number of observations increased. Not sure why.

Any suggestions?

Thanks

Tourmaline | Level 20

## Re: Detect Duplicate Observations by Different Grouping

Can you make a better sample of mock data that's a better representative so I can test again. Also, you could use @mkeintz 's code who has also contributed in the thread.

Discussion stats
• 14 replies
• 2282 views
• 0 likes
• 5 in conversation