BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
altijani
Quartz | Level 8

Good Monday Community,

 

I HAVE the following customers data, which is not sorted (I am sorting it here just to get the concept across):

 

ID_1ID_2State
A1234561ABCDEFL
A1234561ABCDEGA
B9845362HGJHWNC
C5673423KJEWGIL
C5673424UYMVSIL
D7234105PPOADWY
D7234106NNTGCAL

 

I need to create a variable that identify the TYPE of duplicates in my data, so that I can deal with them accordingly. As you can tell from the above, that there are four types of duplicates as in the WANT data below:

 

ID_1ID_2StateType
A1234561ABCDEFLSame IDs, Different States
A1234561ABCDEGASame IDs, Different States
B9845362HGJHWNCUnique
C5673423KJEWGILSame ID_1 and State, Different ID_2
C5673424UYMVSILSame ID_1 and State, Different ID_2
D7234105PPOADWYSame ID_1, Different ID_2 and State
D7234106NNTGCALSame ID_1, Different ID_2 and State

 

I run my code in SAS EG using Proc SQL, and I appreciate any help

 

Many thanks!

1 ACCEPTED SOLUTION

Accepted Solutions
novinosrin
Tourmaline | Level 20
data have;
input (ID_1	ID_2	State) (:$20.);
cards;
A123456	1ABCDE	FL
A123456	1ABCDE	GA
B984536	2HGJHW	NC
C567342	3KJEWG	IL
C567342	4UYMVS	IL
D723410	5PPOAD	WY
D723410	6NNTGC	AL
;
proc sql;
create table want as
select id_1, id_2,state ,ifc(count( distinct state)>1, 'Same IDs, Different States',' ' ) as Type
from have
group by id_1,id_2
having type is not missing
union 
select id_1,id_2, state ,ifc(count( distinct id_2)>1,'Same ID_1 and State, Different ID_2',' ') as Type
from have
group by id_1,state
having type is not missing
union 
select id_1,id_2,state,ifc(count( distinct id_2)>1 and count( distinct state)>1,'Same ID_1, Different ID_2 and State',' ') as Type
from have
group by id_1
having type is not missing
union
select *, ifc(count( id_1)=1,'Unique',' ') as Type
from have
group by id_1
having type is not missing;
quit;

View solution in original post

14 REPLIES 14
RW9
Diamond | Level 26 RW9
Diamond | Level 26

This code is untested, please in future post test data in the form of a datastep!

proc sql;
  create table want as
  select a.*,
  case   when exists(select id_1 from have where id_1=a.id_1 and state ne a.state) 
then "Same IDs, Different States"
when exists(select id_1 from have where id_1=a.id_1 and state=a._state and id_2 ne a.id_2)
then "Same ID_1 and state, different ID_2"
/* Other checks here */
else "Unique" end as type
from have a;
quit;
mkeintz
PROC Star

This solution requires 1 sort and 1 data step:

 

data have;
input (ID_1	ID_2	State) (:$20.);
cards;
A123456	1ABCDE	FL
A123456	1ABCDE	GA
B984536	2HGJHW	NC
C567342	3KJEWG	IL
C567342	4UYMVS	IL
D723410	5PPOAD	WY
D723410	6NNTGC	AL
;

proc sort data=have out=need;
  by ID_1 ID_2 state;
run;

data want (drop=_:);
  retain _ONE 1;
  do until (last.id_1);
    set need;
    by id_1 id_2;
    if _n_=1 and first.id_1 then do;
      declare hash h (suminc:'_one');
        h.definekey('state');
        h.definedone();
    end;
    if first.id_1=1 and last.id_1=1 then continue;  /*treated in later do loop*/
    else if first.id_2=0 or last.id_2=0 then continue; /*treated in later do loop*/
    else if h.find()^=0 then h.add();  /* Add the _ONE value for each instance */
  end;
  length matchtype $20;
  do until (last.id_1);
    set need;
    by id_1 id_2;
    if first.id_1=1 and last.id_1=1 then matchtype='Unique ID_1';
    else if first.id_2=0 or last.id_2=0 then matchtype='Different State';
    else do;
      h.sum(sum:_freq);
      if _freq>1 then matchtype='Different ID_2';
      else matchtype='Diff ID_2 and State';
    end;
    output;
  end;
  h.clear();
run;

 

  1. The sort by ID_1 ID_2 allows for straightforward identification of the completely "unique ID_1" and the "different state" matchtypes in the data step.  Hence the "continue" for these conditions in the first DO loop.
  2. All other observations are recorded in the hash object H, keyed only on STATE. 
    This hash is declared with the SUMINC:'_ONE' argument, which maintains a sum for each key.  Specifically it adds the value of variable _ONE (which need not always=1)  for the first ADD method, and all subsequent FIND methods for each key.  I.e. it adds the value for every successful ADD and FIND method.

    This summary is retrieved into variable _FREQ in the later use of the .SUM method.
  3. In the second loop
    1. The easy-to-determine "unique ID_1" and "different state" records are identified
    2. For other records the hash object for that value of STATE is examined to get the frequency of access to each STATE.  If _freq>2 then match type is "different ID_2", otherwise "different ID_2 and state".
--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
altijani
Quartz | Level 8

Thank you for your code. It works perfectly using the mock data. When I ran in my data, which I can not share of course, there are only two TYPES: 'Unique' and 'Different State'. The second TYPE is an error in almost all of the places picked. In these places, it shouldn't be 'Unique", but not 'Different State' either.

Any suggestions?

Many thanks

Astounding
PROC Star

Note that you can get a detailed compilation of the categories you want in this way:

 

proc freq data=have;

tables ID_1 * ID_2 * state / noprint out=want (where=(count > 1) drop=percent);

run;

 

proc print data=want;

run;

 

Also note that your original data may contain differences that belong in more than one category.  Here is an example of what you could encounter:

 

ID_1 ID_2 State
A123456 1ABCDE FL
A123456 1ABCDE GA
B984536 1ABCDE GA

 

So how do you categorize the second observation?  It resembles both the first and the third, but with different variables changing.

altijani
Quartz | Level 8

Thanks for your proc freq and proc print.

Eventually, this is what I am after: to know the breakdown of each group. My data contains multiple million records, and I am running into:

ERROR: The SAS System stopped processing this step because of insufficient memory.

 

As for your other question/suggestion, it is not possible to have the same ID_2 for two different ID_1. The data is cleaner in that regard.

Astounding
PROC Star

OK, so the same ID_2 cannot appear under different ID_1 values.  That still doesn't eliminate the problem:

 

ID_1 ID_2 State
A123456 1ABCDE FL
A123456 1ABCDE GA
A123456 2ABCDE GA

 

Now the second observation belongs in two different categories, depending on whether you compare to observation 1 or observation 3.

altijani
Quartz | Level 8

You are right: theoretically, that is a possibility that can exist in the data.

mkeintz
PROC Star

Is the data completely unsorted?   Or is it sorted by ID1, but not sorted within each ID1 group?

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
altijani
Quartz | Level 8
Completely unsorted.


novinosrin
Tourmaline | Level 20

is your sample a good representative of your real meaning you have exactly the same number of columns i.e id1, id2 and state? Can you confirm

altijani
Quartz | Level 8

I have multiple columns in the HAVE data (let us say Col1-Col10), but in the WANT data I only need these columns in order to run a breakdown of the groups: ID_1, ID_2, State, and the new "TYPE" column.

 

Thanks.

novinosrin
Tourmaline | Level 20
data have;
input (ID_1	ID_2	State) (:$20.);
cards;
A123456	1ABCDE	FL
A123456	1ABCDE	GA
B984536	2HGJHW	NC
C567342	3KJEWG	IL
C567342	4UYMVS	IL
D723410	5PPOAD	WY
D723410	6NNTGC	AL
;
proc sql;
create table want as
select id_1, id_2,state ,ifc(count( distinct state)>1, 'Same IDs, Different States',' ' ) as Type
from have
group by id_1,id_2
having type is not missing
union 
select id_1,id_2, state ,ifc(count( distinct id_2)>1,'Same ID_1 and State, Different ID_2',' ') as Type
from have
group by id_1,state
having type is not missing
union 
select id_1,id_2,state,ifc(count( distinct id_2)>1 and count( distinct state)>1,'Same ID_1, Different ID_2 and State',' ') as Type
from have
group by id_1
having type is not missing
union
select *, ifc(count( id_1)=1,'Unique',' ') as Type
from have
group by id_1
having type is not missing;
quit;
altijani
Quartz | Level 8

Thank you for the suggested code. It works perfectly in the mock data. When I use my data, which I can not share of course, the number of observations increased. Not sure why.

Any suggestions?

Thanks

novinosrin
Tourmaline | Level 20

Can you make a better sample of mock data that's a better representative so I can test again. Also, you could use @mkeintz 's code who has also contributed in the thread. 

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 14 replies
  • 5342 views
  • 0 likes
  • 5 in conversation