Hello,
I'm working on a data quality problem, and more specifically detecting duplicates with fuzzy logic using COMPGED and COMPLEV functions based on this paper.
I have detected duplicates, but per couple.
example:
assuming that customers with customer_id equals 01, 02 and 03 reference the same customer (let's call him John), my SAS program detects it this way:
row_num | duplicate_id
1 | 01-02
2 | 01-03
3 | 02-03
where duplicate_id is an id that I've created combining both customer ids to reference a couple of duplicates.
So what I want now is to have a unique id for John !
The final result that I want is :
duplicate_id | unique_id
01-02 | 01
01-03 | 01
02-03 | 01
I had the idea to use undirected graph (Operational Research algorithm) but I'm not sure I have the licence for it, and I'm not sure if I can associate a new id even if it works.
the first example in this paper that uses hash objects can be a first step to find the solution, but the expected result is not what I want.
Any ideas? Thanks!
If you don't have the SAS/OR licence, check out this wonderful article about a way to find connected components without SAS/OR:
"(Operational Research algorithm) but I'm not sure I have the licence for it"
Submit
PROC SETINIT;
RUN;
You should see a line a log that reads SAS/OR, otherwise, you haven't it licensed.
Thanks for your answer.
I checked and I don't have licence for it..
MK
If you don't have the SAS/OR licence, check out this wonderful article about a way to find connected components without SAS/OR:
Thank you for finding the : problem. I revised the article. Cheers!
I don't know about SAS/OR but if you can create a dataset of pairs you can do this in base SAS
data pair;
input pair $;
datalines;
01-02
01-03
02-03
04-05
04-06
05-06
;
run;
data new(drop=pair1 pair2);
set pair;
retain unique pair1 pair2;
if _n_=1 then do; unique=1; pair1=scan(pair,1,"-"); pair2=scan(pair,2,"-"); end;
else if indexw(translate(pair," ","-"),pair1)<=0 and indexw(translate(pair," ","-"),pair2)<=0 then do; unique=unique+1; pair1=scan(pair,1,"-"); pair2=scan(pair,2,"-"); end;
run;
Thanks @ndp for your answer,
It worked for your example, but when adding another observation 07-01 to the data I don't get the same unique id as the others.
it would work if I added 01-07 instead, and then sorted the table.
I didn't verify the program with all the prequisites, but it's still a good solution.
Thanks a lot !
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how to run multiple linear regression models with and without interactions, presented by SAS user Alex Chaplin.
Find more tutorials on the SAS Users YouTube channel.