BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
majdi_ka
Obsidian | Level 7

Hello,

I'm working on a data quality problem, and more specifically detecting duplicates with fuzzy logic using COMPGED and COMPLEV functions based on  this paper.

 

I have detected duplicates, but per couple.
example:
assuming that customers with customer_id equals 01, 02 and 03 reference the same customer (let's call him John), my SAS program detects it this way:

 

row_num | duplicate_id 
1 | 01-02 
2 | 01-03 
3 | 02-03

where duplicate_id is an id that I've created combining both customer ids to reference a couple of duplicates.

 

So what I want now is to have a unique id for John !

The final result that I want is :
duplicate_id | unique_id
01-02 | 01
01-03 | 01
02-03 | 01

 

I had the idea to use undirected graph (Operational Research algorithm) but I'm not sure I have the licence for it, and I'm not sure if I can associate a new id even if it works.

 

the first example in this paper that uses hash objects can be a first step to find the solution, but the expected result is not what I want.

 

Any ideas? Thanks!

1 ACCEPTED SOLUTION

Accepted Solutions
PGStats
Opal | Level 21

If you don't have the SAS/OR licence, check out this wonderful Smiley Happy article about a way to find connected components without SAS/OR:

 

https://communities.sas.com/t5/SAS-Communities-Library/How-to-find-all-connected-components-in-a-gra...

 

 

PG

View solution in original post

7 REPLIES 7
LinusH
Tourmaline | Level 20

"(Operational Research algorithm) but I'm not sure I have the licence for it"

 

Submit

PROC SETINIT;

RUN;

 

You should see a line a log that reads SAS/OR, otherwise, you haven't it licensed.

Data never sleeps
majdi_ka
Obsidian | Level 7

Thanks for your answer. 

 

I checked and I don't have licence for it..

 

MK

 

 

PGStats
Opal | Level 21

If you don't have the SAS/OR licence, check out this wonderful Smiley Happy article about a way to find connected components without SAS/OR:

 

https://communities.sas.com/t5/SAS-Communities-Library/How-to-find-all-connected-components-in-a-gra...

 

 

PG
majdi_ka
Obsidian | Level 7

Thank you @PGStats for the great article ! 

 

I 've tried the example in the article, and it worked ! (there's a ":" to replace with ":" in the macro SubGraphs)
I've tried it also on @ndp's example, adding other ids and it worked !

 

Thanks a lot !

 

MK

PGStats
Opal | Level 21

Thank you for finding the : problem. I revised the article. Cheers!

PG
ndp
Quartz | Level 8 ndp
Quartz | Level 8

I don't know about SAS/OR but if you can create a dataset of pairs you can do this in base SAS

data pair;
input pair $;
datalines;
01-02
01-03
02-03
04-05
04-06
05-06
;
run;

data new(drop=pair1 pair2);
	set pair;
	retain unique pair1 pair2;
	if _n_=1 then do; unique=1; pair1=scan(pair,1,"-"); pair2=scan(pair,2,"-"); end;
	else if indexw(translate(pair," ","-"),pair1)<=0 and indexw(translate(pair," ","-"),pair2)<=0 then do; unique=unique+1; pair1=scan(pair,1,"-"); pair2=scan(pair,2,"-"); end;
run;
majdi_ka
Obsidian | Level 7

Thanks @ndp for your answer,

 

It worked for your example, but when adding another observation 07-01 to the data  I don't get the same unique id as the others.

it would work if I added 01-07 instead, and then sorted the table.

 

I didn't verify the program with all the prequisites, but it's still a good solution.

 

Thanks a lot !