Hello,
I'm working on a data quality problem, and more specifically detecting duplicates with fuzzy logic using COMPGED and COMPLEV functions based on this paper.
I have detected duplicates, but per couple.
example:
assuming that customers with customer_id equals 01, 02 and 03 reference the same customer (let's call him John), my SAS program detects it this way:
row_num | duplicate_id
1 | 01-02
2 | 01-03
3 | 02-03
where duplicate_id is an id that I've created combining both customer ids to reference a couple of duplicates.
So what I want now is to have a unique id for John !
The final result that I want is :
duplicate_id | unique_id
01-02 | 01
01-03 | 01
02-03 | 01
I had the idea to use undirected graph (Operational Research algorithm) but I'm not sure I have the licence for it, and I'm not sure if I can associate a new id even if it works.
the first example in this paper that uses hash objects can be a first step to find the solution, but the expected result is not what I want.
Any ideas? Thanks!
If you don't have the SAS/OR licence, check out this wonderful article about a way to find connected components without SAS/OR:
"(Operational Research algorithm) but I'm not sure I have the licence for it"
Submit
PROC SETINIT;
RUN;
You should see a line a log that reads SAS/OR, otherwise, you haven't it licensed.
Thanks for your answer.
I checked and I don't have licence for it..
MK
If you don't have the SAS/OR licence, check out this wonderful article about a way to find connected components without SAS/OR:
Thank you for finding the : problem. I revised the article. Cheers!
I don't know about SAS/OR but if you can create a dataset of pairs you can do this in base SAS
data pair;
input pair $;
datalines;
01-02
01-03
02-03
04-05
04-06
05-06
;
run;
data new(drop=pair1 pair2);
set pair;
retain unique pair1 pair2;
if _n_=1 then do; unique=1; pair1=scan(pair,1,"-"); pair2=scan(pair,2,"-"); end;
else if indexw(translate(pair," ","-"),pair1)<=0 and indexw(translate(pair," ","-"),pair2)<=0 then do; unique=unique+1; pair1=scan(pair,1,"-"); pair2=scan(pair,2,"-"); end;
run;
Thanks @ndp for your answer,
It worked for your example, but when adding another observation 07-01 to the data I don't get the same unique id as the others.
it would work if I added 01-07 instead, and then sorted the table.
I didn't verify the program with all the prequisites, but it's still a good solution.
Thanks a lot !
Catch the best of SAS Innovate 2025 — anytime, anywhere. Stream powerful keynotes, real-world demos, and game-changing insights from the world’s leading data and AI minds.