Hi all.
Newish user to SAS here.
I'm hoping I could please get some guidance with compged function or a function that will return what I am after.
I have dataset where I want to do a vertically look of emails that are similar to one another.
e.g.
Data:
John.Doe1@hotmail.com
johndoe1@hotmail.com
JohnDoe123@hotmail.com
Mary_Ann1234@hotmail.com
Mkj.Luke@hotmail.com
Johndoe@yahoo.com
Ann_Jane123@gmail.com
Luked123@outlook.com
lucky_star456@yahoo.com
How would I approach this to return a score of emails that 'similar'? Happy to build upon possible solutions!
Thank you.
S
Basic strategy: Compare all raw email addresses
proc sql;
select a.EMAIL, b.EMAIL, compged(a.EMAIL, b.EMAIL) as SCORE
from HAVE a, HAVE b
where a.EMAIL ne b.EMAIL;
This can produce massive volumes,
Smarter: Add some improvements as needed, depending on the data
proc sql;
select a.EMAIL, b.EMAIL, compged(lowcase(a.EMAIL), lowcase(b.EMAIL)) as SCORE
from HAVE a, HAVE b
where a.EMAIL ne b.EMAIL
and lowcase(first(a.EMAIL))=lowcase(first(b.EMAIL));
Here, we ensure the case is the same, and we reduce the size of the join by using an additional relevant criteria, such as the first letter being the same.
Basic strategy: Compare all raw email addresses
proc sql;
select a.EMAIL, b.EMAIL, compged(a.EMAIL, b.EMAIL) as SCORE
from HAVE a, HAVE b
where a.EMAIL ne b.EMAIL;
This can produce massive volumes,
Smarter: Add some improvements as needed, depending on the data
proc sql;
select a.EMAIL, b.EMAIL, compged(lowcase(a.EMAIL), lowcase(b.EMAIL)) as SCORE
from HAVE a, HAVE b
where a.EMAIL ne b.EMAIL
and lowcase(first(a.EMAIL))=lowcase(first(b.EMAIL));
Here, we ensure the case is the same, and we reduce the size of the join by using an additional relevant criteria, such as the first letter being the same.
You could have other criteria, such as similar length, or same domain.
You could also add a filter on the output, such as:
proc sql;
select a.EMAIL, b.EMAIL, compged(lowcase(a.EMAIL), lowcase(b.EMAIL)) as SCORE
from HAVE a, HAVE b
where a.EMAIL ne b.EMAIL
and first(a.EMAIL)=first(b.EMAIL)
having SCORE < 900;
Thank you for that solution @ChrisNZ.
It has provided exactly what I am after.
I've noticed the score is being impacted due to the @domain, and is providing false positives as it reading the similarities in this as well.
I created the 'TRIMS' function by Leonid Batkhan to remove the trailing characters after '@' to attempt a workaround, however doesn't appear to working on my end.
Would you have any suggestions to remove trialing characters?
Using the scan functions should work: scan(email, 1, '@').
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.