Hello,
I'm working on a data quality problem, and more specifically detecting duplicates with fuzzy logic using COMPGED and COMPLEV functions based on this paper.
I have detected duplicates, but per couple.
example:
assuming that customers with customer_id equals 01, 02 and 03 reference the same customer (let's call him John), my SAS program detects it this way:
row_num | duplicate_id
1 | 01-02
2 | 01-03
3 | 02-03
where duplicate_id is an id that I've created combining both customer ids to reference a couple of duplicates.
So what I want now is to have a unique id for John !
The final result that I want is :
duplicate_id | unique_id
01-02 | 01
01-03 | 01
02-03 | 01
I had the idea to use undirected graph (Operational Research algorithm) but I'm not sure I have the licence for it, and I'm not sure if I can associate a new id even if it works.
the first example in this paper that uses hash objects can be a first step to find the solution, but the expected result is not what I want.
Any ideas? Thanks!
If you don't have the SAS/OR licence, check out this wonderful article about a way to find connected components without SAS/OR:
"(Operational Research algorithm) but I'm not sure I have the licence for it"
Submit
PROC SETINIT;
RUN;
You should see a line a log that reads SAS/OR, otherwise, you haven't it licensed.
Thanks for your answer.
I checked and I don't have licence for it..
MK
If you don't have the SAS/OR licence, check out this wonderful article about a way to find connected components without SAS/OR:
Thank you for finding the : problem. I revised the article. Cheers!
I don't know about SAS/OR but if you can create a dataset of pairs you can do this in base SAS
data pair;
input pair $;
datalines;
01-02
01-03
02-03
04-05
04-06
05-06
;
run;
data new(drop=pair1 pair2);
set pair;
retain unique pair1 pair2;
if _n_=1 then do; unique=1; pair1=scan(pair,1,"-"); pair2=scan(pair,2,"-"); end;
else if indexw(translate(pair," ","-"),pair1)<=0 and indexw(translate(pair," ","-"),pair2)<=0 then do; unique=unique+1; pair1=scan(pair,1,"-"); pair2=scan(pair,2,"-"); end;
run;
Thanks @ndp for your answer,
It worked for your example, but when adding another observation 07-01 to the data I don't get the same unique id as the others.
it would work if I added 01-07 instead, and then sorted the table.
I didn't verify the program with all the prequisites, but it's still a good solution.
Thanks a lot !
Registration is open! SAS is returning to Vegas for an AI and analytics experience like no other! Whether you're an executive, manager, end user or SAS partner, SAS Innovate is designed for everyone on your team. Register for just $495 by 12/31/2023.
If you are interested in speaking, there is still time to submit a session idea. More details are posted on the website.
Learn how to run multiple linear regression models with and without interactions, presented by SAS user Alex Chaplin.
Find more tutorials on the SAS Users YouTube channel.