Solved: Re: Detect Duplicate Observations by Different Grouping

altijani · Posted 10-29-2018 07:57 AM

Good Monday Community,

I HAVE the following customers data, which is not sorted (I am sorting it here just to get the concept across):

ID_1	ID_2	State
A123456	1ABCDE	FL
A123456	1ABCDE	GA
B984536	2HGJHW	NC
C567342	3KJEWG	IL
C567342	4UYMVS	IL
D723410	5PPOAD	WY
D723410	6NNTGC	AL

I need to create a variable that identify the TYPE of duplicates in my data, so that I can deal with them accordingly. As you can tell from the above, that there are four types of duplicates as in the WANT data below:

ID_1	ID_2	State	Type
A123456	1ABCDE	FL	Same IDs, Different States
A123456	1ABCDE	GA	Same IDs, Different States
B984536	2HGJHW	NC	Unique
C567342	3KJEWG	IL	Same ID_1 and State, Different ID_2
C567342	4UYMVS	IL	Same ID_1 and State, Different ID_2
D723410	5PPOAD	WY	Same ID_1, Different ID_2 and State
D723410	6NNTGC	AL	Same ID_1, Different ID_2 and State

I run my code in SAS EG using Proc SQL, and I appreciate any help

Many thanks!

novinosrin · Posted 10-29-2018 12:20 PM

data have;
input (ID_1	ID_2	State) (:$20.);
cards;
A123456	1ABCDE	FL
A123456	1ABCDE	GA
B984536	2HGJHW	NC
C567342	3KJEWG	IL
C567342	4UYMVS	IL
D723410	5PPOAD	WY
D723410	6NNTGC	AL
;
proc sql;
create table want as
select id_1, id_2,state ,ifc(count( distinct state)>1, 'Same IDs, Different States',' ' ) as Type
from have
group by id_1,id_2
having type is not missing
union 
select id_1,id_2, state ,ifc(count( distinct id_2)>1,'Same ID_1 and State, Different ID_2',' ') as Type
from have
group by id_1,state
having type is not missing
union 
select id_1,id_2,state,ifc(count( distinct id_2)>1 and count( distinct state)>1,'Same ID_1, Different ID_2 and State',' ') as Type
from have
group by id_1
having type is not missing
union
select *, ifc(count( id_1)=1,'Unique',' ') as Type
from have
group by id_1
having type is not missing;
quit;

View solution in original post

RW9 · Posted 10-29-2018 08:04 AM

This code is untested, please in future post test data in the form of a datastep!

proc sql;
  create table want as
  select a.*,
  case   when exists(select id_1 from have where id_1=a.id_1 and state ne a.state) 
           then "Same IDs, Different States"
         when exists(select id_1 from have where id_1=a.id_1 and state=a._state and id_2 ne a.id_2) 
           then "Same ID_1 and state, different ID_2"
         /* Other checks here */
         else "Unique" end as type
  from   have a;
quit;

mkeintz · Posted 10-29-2018 02:58 PM

This solution requires 1 sort and 1 data step:

data have;
input (ID_1	ID_2	State) (:$20.);
cards;
A123456	1ABCDE	FL
A123456	1ABCDE	GA
B984536	2HGJHW	NC
C567342	3KJEWG	IL
C567342	4UYMVS	IL
D723410	5PPOAD	WY
D723410	6NNTGC	AL
;

proc sort data=have out=need;
  by ID_1 ID_2 state;
run;

data want (drop=_:);
  retain _ONE 1;
  do until (last.id_1);
    set need;
    by id_1 id_2;
    if _n_=1 and first.id_1 then do;
      declare hash h (suminc:'_one');
        h.definekey('state');
        h.definedone();
    end;
    if first.id_1=1 and last.id_1=1 then continue;  /*treated in later do loop*/
    else if first.id_2=0 or last.id_2=0 then continue; /*treated in later do loop*/
    else if h.find()^=0 then h.add();  /* Add the _ONE value for each instance */
  end;
  length matchtype $20;
  do until (last.id_1);
    set need;
    by id_1 id_2;
    if first.id_1=1 and last.id_1=1 then matchtype='Unique ID_1';
    else if first.id_2=0 or last.id_2=0 then matchtype='Different State';
    else do;
      h.sum(sum:_freq);
      if _freq>1 then matchtype='Different ID_2';
      else matchtype='Diff ID_2 and State';
    end;
    output;
  end;
  h.clear();
run;

The sort by ID_1 ID_2 allows for straightforward identification of the completely "unique ID_1" and the "different state" matchtypes in the data step. Hence the "continue" for these conditions in the first DO loop.
All other observations are recorded in the hash object H, keyed only on STATE.
This hash is declared with the SUMINC:'_ONE' argument, which maintains a sum for each key. Specifically it adds the value of variable _ONE (which need not always=1) for the first ADD method, and all subsequent FIND methods for each key. I.e. it adds the value for every successful ADD and FIND method.

This summary is retrieved into variable _FREQ in the later use of the .SUM method.
In the second loop
1. The easy-to-determine "unique ID_1" and "different state" records are identified
2. For other records the hash object for that value of STATE is examined to get the frequency of access to each STATE. If _freq>2 then match type is "different ID_2", otherwise "different ID_2 and state".

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

altijani · Posted 10-30-2018 02:47 PM

Thank you for your code. It works perfectly using the mock data. When I ran in my data, which I can not share of course, there are only two TYPES: 'Unique' and 'Different State'. The second TYPE is an error in almost all of the places picked. In these places, it shouldn't be 'Unique", but not 'Different State' either.

Any suggestions?

Many thanks

Astounding · Posted 10-29-2018 10:21 AM

Note that you can get a detailed compilation of the categories you want in this way:

proc freq data=have;

tables ID_1 * ID_2 * state / noprint out=want (where=(count > 1) drop=percent);

run;

proc print data=want;

run;

Also note that your original data may contain differences that belong in more than one category. Here is an example of what you could encounter:

ID_1	ID_2	State
A123456	1ABCDE	FL
A123456	1ABCDE	GA
B984536	1ABCDE	GA

So how do you categorize the second observation? It resembles both the first and the third, but with different variables changing.

altijani · Posted 10-29-2018 11:01 AM

Thanks for your proc freq and proc print.

Eventually, this is what I am after: to know the breakdown of each group. My data contains multiple million records, and I am running into:

ERROR: The SAS System stopped processing this step because of insufficient memory.

As for your other question/suggestion, it is not possible to have the same ID_2 for two different ID_1. The data is cleaner in that regard.

Astounding · Posted 10-29-2018 11:14 AM

OK, so the same ID_2 cannot appear under different ID_1 values. That still doesn't eliminate the problem:

ID_1	ID_2	State
A123456	1ABCDE	FL
A123456	1ABCDE	GA
A123456	2ABCDE	GA

Now the second observation belongs in two different categories, depending on whether you compare to observation 1 or observation 3.

altijani · Posted 10-29-2018 11:19 AM

You are right: theoretically, that is a possibility that can exist in the data.

mkeintz · Posted 10-29-2018 10:39 AM

Is the data completely unsorted? Or is it sorted by ID1, but not sorted within each ID1 group?

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

altijani · Posted 10-29-2018 10:44 AM

Completely unsorted.

novinosrin · Posted 10-29-2018 11:08 AM

is your sample a good representative of your real meaning you have exactly the same number of columns i.e id1, id2 and state? Can you confirm

altijani · Posted 10-29-2018 11:15 AM

I have multiple columns in the HAVE data (let us say Col1-Col10), but in the WANT data I only need these columns in order to run a breakdown of the groups: ID_1, ID_2, State, and the new "TYPE" column.

Thanks.

novinosrin · Posted 10-29-2018 12:20 PM

data have;
input (ID_1	ID_2	State) (:$20.);
cards;
A123456	1ABCDE	FL
A123456	1ABCDE	GA
B984536	2HGJHW	NC
C567342	3KJEWG	IL
C567342	4UYMVS	IL
D723410	5PPOAD	WY
D723410	6NNTGC	AL
;
proc sql;
create table want as
select id_1, id_2,state ,ifc(count( distinct state)>1, 'Same IDs, Different States',' ' ) as Type
from have
group by id_1,id_2
having type is not missing
union 
select id_1,id_2, state ,ifc(count( distinct id_2)>1,'Same ID_1 and State, Different ID_2',' ') as Type
from have
group by id_1,state
having type is not missing
union 
select id_1,id_2,state,ifc(count( distinct id_2)>1 and count( distinct state)>1,'Same ID_1, Different ID_2 and State',' ') as Type
from have
group by id_1
having type is not missing
union
select *, ifc(count( id_1)=1,'Unique',' ') as Type
from have
group by id_1
having type is not missing;
quit;

altijani · Posted 10-30-2018 02:34 PM

Thank you for the suggested code. It works perfectly in the mock data. When I use my data, which I can not share of course, the number of observations increased. Not sure why.

Any suggestions?

Thanks

novinosrin · Posted 10-30-2018 02:35 PM

Can you make a better sample of mock data that's a better representative so I can test again. Also, you could use @mkeintz 's code who has also contributed in the thread.

Classroom Training Available!