Good Monday Community,
I HAVE the following customers data, which is not sorted (I am sorting it here just to get the concept across):
ID_1 | ID_2 | State |
A123456 | 1ABCDE | FL |
A123456 | 1ABCDE | GA |
B984536 | 2HGJHW | NC |
C567342 | 3KJEWG | IL |
C567342 | 4UYMVS | IL |
D723410 | 5PPOAD | WY |
D723410 | 6NNTGC | AL |
I need to create a variable that identify the TYPE of duplicates in my data, so that I can deal with them accordingly. As you can tell from the above, that there are four types of duplicates as in the WANT data below:
ID_1 | ID_2 | State | Type |
A123456 | 1ABCDE | FL | Same IDs, Different States |
A123456 | 1ABCDE | GA | Same IDs, Different States |
B984536 | 2HGJHW | NC | Unique |
C567342 | 3KJEWG | IL | Same ID_1 and State, Different ID_2 |
C567342 | 4UYMVS | IL | Same ID_1 and State, Different ID_2 |
D723410 | 5PPOAD | WY | Same ID_1, Different ID_2 and State |
D723410 | 6NNTGC | AL | Same ID_1, Different ID_2 and State |
I run my code in SAS EG using Proc SQL, and I appreciate any help
Many thanks!
data have;
input (ID_1 ID_2 State) (:$20.);
cards;
A123456 1ABCDE FL
A123456 1ABCDE GA
B984536 2HGJHW NC
C567342 3KJEWG IL
C567342 4UYMVS IL
D723410 5PPOAD WY
D723410 6NNTGC AL
;
proc sql;
create table want as
select id_1, id_2,state ,ifc(count( distinct state)>1, 'Same IDs, Different States',' ' ) as Type
from have
group by id_1,id_2
having type is not missing
union
select id_1,id_2, state ,ifc(count( distinct id_2)>1,'Same ID_1 and State, Different ID_2',' ') as Type
from have
group by id_1,state
having type is not missing
union
select id_1,id_2,state,ifc(count( distinct id_2)>1 and count( distinct state)>1,'Same ID_1, Different ID_2 and State',' ') as Type
from have
group by id_1
having type is not missing
union
select *, ifc(count( id_1)=1,'Unique',' ') as Type
from have
group by id_1
having type is not missing;
quit;
This code is untested, please in future post test data in the form of a datastep!
proc sql; create table want as select a.*, case when exists(select id_1 from have where id_1=a.id_1 and state ne a.state)
then "Same IDs, Different States"
when exists(select id_1 from have where id_1=a.id_1 and state=a._state and id_2 ne a.id_2)
then "Same ID_1 and state, different ID_2"
/* Other checks here */
else "Unique" end as type
from have a;
quit;
This solution requires 1 sort and 1 data step:
data have;
input (ID_1 ID_2 State) (:$20.);
cards;
A123456 1ABCDE FL
A123456 1ABCDE GA
B984536 2HGJHW NC
C567342 3KJEWG IL
C567342 4UYMVS IL
D723410 5PPOAD WY
D723410 6NNTGC AL
;
proc sort data=have out=need;
by ID_1 ID_2 state;
run;
data want (drop=_:);
retain _ONE 1;
do until (last.id_1);
set need;
by id_1 id_2;
if _n_=1 and first.id_1 then do;
declare hash h (suminc:'_one');
h.definekey('state');
h.definedone();
end;
if first.id_1=1 and last.id_1=1 then continue; /*treated in later do loop*/
else if first.id_2=0 or last.id_2=0 then continue; /*treated in later do loop*/
else if h.find()^=0 then h.add(); /* Add the _ONE value for each instance */
end;
length matchtype $20;
do until (last.id_1);
set need;
by id_1 id_2;
if first.id_1=1 and last.id_1=1 then matchtype='Unique ID_1';
else if first.id_2=0 or last.id_2=0 then matchtype='Different State';
else do;
h.sum(sum:_freq);
if _freq>1 then matchtype='Different ID_2';
else matchtype='Diff ID_2 and State';
end;
output;
end;
h.clear();
run;
Thank you for your code. It works perfectly using the mock data. When I ran in my data, which I can not share of course, there are only two TYPES: 'Unique' and 'Different State'. The second TYPE is an error in almost all of the places picked. In these places, it shouldn't be 'Unique", but not 'Different State' either.
Any suggestions?
Many thanks
Note that you can get a detailed compilation of the categories you want in this way:
proc freq data=have;
tables ID_1 * ID_2 * state / noprint out=want (where=(count > 1) drop=percent);
run;
proc print data=want;
run;
Also note that your original data may contain differences that belong in more than one category. Here is an example of what you could encounter:
ID_1 | ID_2 | State |
A123456 | 1ABCDE | FL |
A123456 | 1ABCDE | GA |
B984536 | 1ABCDE | GA |
So how do you categorize the second observation? It resembles both the first and the third, but with different variables changing.
Thanks for your proc freq and proc print.
Eventually, this is what I am after: to know the breakdown of each group. My data contains multiple million records, and I am running into:
ERROR: The SAS System stopped processing this step because of insufficient memory.
As for your other question/suggestion, it is not possible to have the same ID_2 for two different ID_1. The data is cleaner in that regard.
OK, so the same ID_2 cannot appear under different ID_1 values. That still doesn't eliminate the problem:
ID_1 | ID_2 | State |
A123456 | 1ABCDE | FL |
A123456 | 1ABCDE | GA |
A123456 | 2ABCDE | GA |
Now the second observation belongs in two different categories, depending on whether you compare to observation 1 or observation 3.
You are right: theoretically, that is a possibility that can exist in the data.
Is the data completely unsorted? Or is it sorted by ID1, but not sorted within each ID1 group?
is your sample a good representative of your real meaning you have exactly the same number of columns i.e id1, id2 and state? Can you confirm
I have multiple columns in the HAVE data (let us say Col1-Col10), but in the WANT data I only need these columns in order to run a breakdown of the groups: ID_1, ID_2, State, and the new "TYPE" column.
Thanks.
data have;
input (ID_1 ID_2 State) (:$20.);
cards;
A123456 1ABCDE FL
A123456 1ABCDE GA
B984536 2HGJHW NC
C567342 3KJEWG IL
C567342 4UYMVS IL
D723410 5PPOAD WY
D723410 6NNTGC AL
;
proc sql;
create table want as
select id_1, id_2,state ,ifc(count( distinct state)>1, 'Same IDs, Different States',' ' ) as Type
from have
group by id_1,id_2
having type is not missing
union
select id_1,id_2, state ,ifc(count( distinct id_2)>1,'Same ID_1 and State, Different ID_2',' ') as Type
from have
group by id_1,state
having type is not missing
union
select id_1,id_2,state,ifc(count( distinct id_2)>1 and count( distinct state)>1,'Same ID_1, Different ID_2 and State',' ') as Type
from have
group by id_1
having type is not missing
union
select *, ifc(count( id_1)=1,'Unique',' ') as Type
from have
group by id_1
having type is not missing;
quit;
Thank you for the suggested code. It works perfectly in the mock data. When I use my data, which I can not share of course, the number of observations increased. Not sure why.
Any suggestions?
Thanks
Can you make a better sample of mock data that's a better representative so I can test again. Also, you could use @mkeintz 's code who has also contributed in the thread.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.