Solved: Compged - Email similarities.

Sdixon1 · Posted 02-12-2023 06:57 PM

Hi all.

Newish user to SAS here.

I'm hoping I could please get some guidance with compged function or a function that will return what I am after.

I have dataset where I want to do a vertically look of emails that are similar to one another.

e.g.
Data:
John.Doe1@hotmail.com
johndoe1@hotmail.com
JohnDoe123@hotmail.com
Mary_Ann1234@hotmail.com
Mkj.Luke@hotmail.com
Johndoe@yahoo.com
Ann_Jane123@gmail.com
Luked123@outlook.com
lucky_star456@yahoo.com

How would I approach this to return a score of emails that 'similar'? Happy to build upon possible solutions!

Thank you.

S

ChrisNZ · Posted 02-12-2023 09:05 PM

Basic strategy: Compare all raw email addresses

proc sql;
  select a.EMAIL, b.EMAIL, compged(a.EMAIL, b.EMAIL) as SCORE
  from HAVE a, HAVE b
  where a.EMAIL ne b.EMAIL;

This can produce massive volumes,

Smarter: Add some improvements as needed, depending on the data

proc sql;
  select a.EMAIL, b.EMAIL, compged(lowcase(a.EMAIL), lowcase(b.EMAIL)) as SCORE
  from HAVE a, HAVE b
  where a.EMAIL ne b.EMAIL
    and lowcase(first(a.EMAIL))=lowcase(first(b.EMAIL));

Here, we ensure the case is the same, and we reduce the size of the join by using an additional relevant criteria, such as the first letter being the same.

High-Performance SAS Coding - Third Edition

View solution in original post

ChrisNZ · Posted 02-12-2023 09:05 PM

Basic strategy: Compare all raw email addresses

proc sql;
  select a.EMAIL, b.EMAIL, compged(a.EMAIL, b.EMAIL) as SCORE
  from HAVE a, HAVE b
  where a.EMAIL ne b.EMAIL;

This can produce massive volumes,

Smarter: Add some improvements as needed, depending on the data

proc sql;
  select a.EMAIL, b.EMAIL, compged(lowcase(a.EMAIL), lowcase(b.EMAIL)) as SCORE
  from HAVE a, HAVE b
  where a.EMAIL ne b.EMAIL
    and lowcase(first(a.EMAIL))=lowcase(first(b.EMAIL));

Here, we ensure the case is the same, and we reduce the size of the join by using an additional relevant criteria, such as the first letter being the same.

High-Performance SAS Coding - Third Edition

ChrisNZ · Posted 02-12-2023 09:08 PM

You could have other criteria, such as similar length, or same domain.

High-Performance SAS Coding - Third Edition

ChrisNZ · Posted 02-12-2023 09:10 PM

You could also add a filter on the output, such as:

proc sql;
  select a.EMAIL, b.EMAIL, compged(lowcase(a.EMAIL), lowcase(b.EMAIL)) as SCORE
  from HAVE a, HAVE b
  where a.EMAIL ne b.EMAIL
    and first(a.EMAIL)=first(b.EMAIL)
  having SCORE < 900;

High-Performance SAS Coding - Third Edition

Sdixon1 · Posted 02-15-2023 12:19 AM

Thank you for that solution @ChrisNZ.

It has provided exactly what I am after.

I've noticed the score is being impacted due to the @domain, and is providing false positives as it reading the similarities in this as well.

I created the 'TRIMS' function by Leonid Batkhan to remove the trailing characters after '@' to attempt a workaround, however doesn't appear to working on my end.

Would you have any suggestions to remove trialing characters?

andreas_lds · Posted 02-15-2023 01:09 AM

Using the scan functions should work: scan(email, 1, '@').

Compged - Email similarities.

Re: Compged - Email similarities.

Re: Compged - Email similarities.

Re: Compged - Email similarities.

Re: Compged - Email similarities.

Re: Compged - Email similarities.

Re: Compged - Email similarities.

Catch up on SAS Innovate 2026

Compged - Email similarities.

Re: Compged - Email similarities.

Re: Compged - Email similarities.

Re: Compged - Email similarities.

Re: Compged - Email similarities.

Re: Compged - Email similarities.

Re: Compged - Email similarities.

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away