BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
Sdixon1
Calcite | Level 5

Hi all. 

Newish user to SAS here. 

I'm hoping I could please get some guidance with compged function or a function that will return what I am after. 

I have dataset where I want to do a vertically look of emails that are similar to one another. 

e.g.
Data:
John.Doe1@hotmail.com
johndoe1@hotmail.com
JohnDoe123@hotmail.com
Mary_Ann1234@hotmail.com
Mkj.Luke@hotmail.com
Johndoe@yahoo.com
Ann_Jane123@gmail.com
Luked123@outlook.com
lucky_star456@yahoo.com

How would I approach this to return a score of emails that 'similar'?  Happy to build upon possible solutions!

Thank you. 

S

1 ACCEPTED SOLUTION

Accepted Solutions
ChrisNZ
Tourmaline | Level 20

Basic strategy: Compare all raw email addresses

proc sql;
  select a.EMAIL, b.EMAIL, compged(a.EMAIL, b.EMAIL) as SCORE
  from HAVE a, HAVE b
  where a.EMAIL ne b.EMAIL;

This can produce massive volumes,

 

Smarter: Add some improvements as needed, depending on the data 

proc sql;
  select a.EMAIL, b.EMAIL, compged(lowcase(a.EMAIL), lowcase(b.EMAIL)) as SCORE
  from HAVE a, HAVE b
  where a.EMAIL ne b.EMAIL
    and lowcase(first(a.EMAIL))=lowcase(first(b.EMAIL));

Here, we ensure the case is the same, and we reduce the size of the join by using an additional relevant criteria, such as the first letter being the same.

 

 

 

View solution in original post

5 REPLIES 5
ChrisNZ
Tourmaline | Level 20

Basic strategy: Compare all raw email addresses

proc sql;
  select a.EMAIL, b.EMAIL, compged(a.EMAIL, b.EMAIL) as SCORE
  from HAVE a, HAVE b
  where a.EMAIL ne b.EMAIL;

This can produce massive volumes,

 

Smarter: Add some improvements as needed, depending on the data 

proc sql;
  select a.EMAIL, b.EMAIL, compged(lowcase(a.EMAIL), lowcase(b.EMAIL)) as SCORE
  from HAVE a, HAVE b
  where a.EMAIL ne b.EMAIL
    and lowcase(first(a.EMAIL))=lowcase(first(b.EMAIL));

Here, we ensure the case is the same, and we reduce the size of the join by using an additional relevant criteria, such as the first letter being the same.

 

 

 

ChrisNZ
Tourmaline | Level 20

You could have other criteria, such as similar length, or same domain.

ChrisNZ
Tourmaline | Level 20

You could also add a filter on the output, such as:

proc sql;
  select a.EMAIL, b.EMAIL, compged(lowcase(a.EMAIL), lowcase(b.EMAIL)) as SCORE
  from HAVE a, HAVE b
  where a.EMAIL ne b.EMAIL
    and first(a.EMAIL)=first(b.EMAIL)
  having SCORE < 900;
Sdixon1
Calcite | Level 5

Thank you for that solution @ChrisNZ

It has provided exactly what I am after.

I've noticed the score is being impacted due to the @domain, and is providing false positives as it reading the similarities in this as well.  

I created the 'TRIMS' function by Leonid Batkhan to remove the trailing characters after '@' to attempt a workaround, however doesn't appear to working on my end. 

Would you have any suggestions to remove trialing characters? 


andreas_lds
Jade | Level 19

Using the scan functions should work: scan(email, 1, '@').

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 5 replies
  • 778 views
  • 0 likes
  • 3 in conversation