Solved: Re: Compged - Email similarities.

Sdixon1 · Posted 02-12-2023 06:57 PM

Hi all.

Newish user to SAS here.

I'm hoping I could please get some guidance with compged function or a function that will return what I am after.

I have dataset where I want to do a vertically look of emails that are similar to one another.

e.g.
Data:
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]

How would I approach this to return a score of emails that 'similar'? Happy to build upon possible solutions!

Thank you.

S

ChrisNZ · Posted 02-12-2023 09:05 PM

Basic strategy: Compare all raw email addresses

proc sql;
  select a.EMAIL, b.EMAIL, compged(a.EMAIL, b.EMAIL) as SCORE
  from HAVE a, HAVE b
  where a.EMAIL ne b.EMAIL;

This can produce massive volumes,

Smarter: Add some improvements as needed, depending on the data

proc sql;
  select a.EMAIL, b.EMAIL, compged(lowcase(a.EMAIL), lowcase(b.EMAIL)) as SCORE
  from HAVE a, HAVE b
  where a.EMAIL ne b.EMAIL
    and lowcase(first(a.EMAIL))=lowcase(first(b.EMAIL));

Here, we ensure the case is the same, and we reduce the size of the join by using an additional relevant criteria, such as the first letter being the same.

High-Performance SAS Coding - Third Edition

View solution in original post

ChrisNZ · Posted 02-12-2023 09:05 PM

Basic strategy: Compare all raw email addresses

proc sql;
  select a.EMAIL, b.EMAIL, compged(a.EMAIL, b.EMAIL) as SCORE
  from HAVE a, HAVE b
  where a.EMAIL ne b.EMAIL;

This can produce massive volumes,

Smarter: Add some improvements as needed, depending on the data

proc sql;
  select a.EMAIL, b.EMAIL, compged(lowcase(a.EMAIL), lowcase(b.EMAIL)) as SCORE
  from HAVE a, HAVE b
  where a.EMAIL ne b.EMAIL
    and lowcase(first(a.EMAIL))=lowcase(first(b.EMAIL));

Here, we ensure the case is the same, and we reduce the size of the join by using an additional relevant criteria, such as the first letter being the same.

High-Performance SAS Coding - Third Edition

ChrisNZ · Posted 02-12-2023 09:08 PM

You could have other criteria, such as similar length, or same domain.

High-Performance SAS Coding - Third Edition

ChrisNZ · Posted 02-12-2023 09:10 PM

You could also add a filter on the output, such as:

proc sql;
  select a.EMAIL, b.EMAIL, compged(lowcase(a.EMAIL), lowcase(b.EMAIL)) as SCORE
  from HAVE a, HAVE b
  where a.EMAIL ne b.EMAIL
    and first(a.EMAIL)=first(b.EMAIL)
  having SCORE < 900;

High-Performance SAS Coding - Third Edition

Sdixon1 · Posted 02-15-2023 12:19 AM

Thank you for that solution @ChrisNZ.

It has provided exactly what I am after.

I've noticed the score is being impacted due to the @domain, and is providing false positives as it reading the similarities in this as well.

I created the 'TRIMS' function by Leonid Batkhan to remove the trailing characters after '@' to attempt a workaround, however doesn't appear to working on my end.

Would you have any suggestions to remove trialing characters?