BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
majdi_ka
Obsidian | Level 7

Hello,

I'm working on a data quality problem, and more specifically detecting duplicates with fuzzy logic using COMPGED and COMPLEV functions based on  this paper.

 

I have detected duplicates, but per couple.
example:
assuming that customers with customer_id equals 01, 02 and 03 reference the same customer (let's call him John), my SAS program detects it this way:

 

row_num | duplicate_id 
1 | 01-02 
2 | 01-03 
3 | 02-03

where duplicate_id is an id that I've created combining both customer ids to reference a couple of duplicates.

 

So what I want now is to have a unique id for John !

The final result that I want is :
duplicate_id | unique_id
01-02 | 01
01-03 | 01
02-03 | 01

 

I had the idea to use undirected graph (Operational Research algorithm) but I'm not sure I have the licence for it, and I'm not sure if I can associate a new id even if it works.

 

the first example in this paper that uses hash objects can be a first step to find the solution, but the expected result is not what I want.

 

Any ideas? Thanks!

1 ACCEPTED SOLUTION

Accepted Solutions
PGStats
Opal | Level 21

If you don't have the SAS/OR licence, check out this wonderful Smiley Happy article about a way to find connected components without SAS/OR:

 

https://communities.sas.com/t5/SAS-Communities-Library/How-to-find-all-connected-components-in-a-gra...

 

 

PG

View solution in original post

7 REPLIES 7
LinusH
Tourmaline | Level 20

"(Operational Research algorithm) but I'm not sure I have the licence for it"

 

Submit

PROC SETINIT;

RUN;

 

You should see a line a log that reads SAS/OR, otherwise, you haven't it licensed.

Data never sleeps
majdi_ka
Obsidian | Level 7

Thanks for your answer. 

 

I checked and I don't have licence for it..

 

MK

 

 

PGStats
Opal | Level 21

If you don't have the SAS/OR licence, check out this wonderful Smiley Happy article about a way to find connected components without SAS/OR:

 

https://communities.sas.com/t5/SAS-Communities-Library/How-to-find-all-connected-components-in-a-gra...

 

 

PG
majdi_ka
Obsidian | Level 7

Thank you @PGStats for the great article ! 

 

I 've tried the example in the article, and it worked ! (there's a ":" to replace with ":" in the macro SubGraphs)
I've tried it also on @ndp's example, adding other ids and it worked !

 

Thanks a lot !

 

MK

PGStats
Opal | Level 21

Thank you for finding the : problem. I revised the article. Cheers!

PG
ndp
Quartz | Level 8 ndp
Quartz | Level 8

I don't know about SAS/OR but if you can create a dataset of pairs you can do this in base SAS

data pair;
input pair $;
datalines;
01-02
01-03
02-03
04-05
04-06
05-06
;
run;

data new(drop=pair1 pair2);
	set pair;
	retain unique pair1 pair2;
	if _n_=1 then do; unique=1; pair1=scan(pair,1,"-"); pair2=scan(pair,2,"-"); end;
	else if indexw(translate(pair," ","-"),pair1)<=0 and indexw(translate(pair," ","-"),pair2)<=0 then do; unique=unique+1; pair1=scan(pair,1,"-"); pair2=scan(pair,2,"-"); end;
run;
majdi_ka
Obsidian | Level 7

Thanks @ndp for your answer,

 

It worked for your example, but when adding another observation 07-01 to the data  I don't get the same unique id as the others.

it would work if I added 01-07 instead, and then sorted the table.

 

I didn't verify the program with all the prequisites, but it's still a good solution.

 

Thanks a lot !

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

Multiple Linear Regression in SAS

Learn how to run multiple linear regression models with and without interactions, presented by SAS user Alex Chaplin.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 7 replies
  • 1412 views
  • 3 likes
  • 4 in conversation