Dear SAS Community
I would like to be able to link accounts with unique e-mails which are similar to each other, for example:
Account 1. vladimr241@gmail.com
Accoun 2. vladimr231@gmail.com
Account 3. vladim1245@gmail.com
Account 4. vladimra3333@gmail.com
The ultimate goal would be to create a summary table which would say that based on the example above, we are dealing withL
- 1 account holder (1 person responsible for creating all accounts) linked with 4 similar emails.
- Or we can summarise it as 4 accounts linked with 1 e-mail (so we are still assuming 1 person responsible for creating all accounts, but this time we are saying that four accounts were created using the same (as in almost identical) e-mail address.
I came across SAS pdf titled "Using Edit-Distance Functions to Identify “Similar” E-Mail Addresses" which discusses SPEDIS, COMPLEV, COMPEGED procedures. Unfortunately, things discussed there are quite vague and I prorably need slightly more basic tutorial, so I was wondering whether there is any standard query that would meet my requirements (i.e. summarise it in the above-described way) if I applied it to tens of thousands of e-mails address.
It is a difficult question to answer , I tried Soundex function .But for ten thousands different names you have to find a different approach , I try to find and post the answer in any.
data newdata;
input Emailid$ 30.;
Emailid1=soundex(Emailid);
datalines;
vladimr241@gmail.com
vladimr231@gmail.com
vladim1245@gmail.com
vladimra100000@gmail.com
Hello@gmail.com
Val@1234567890
;
run;
Proc Print data = newdata;
run;
April 27 – 30 | Gaylord Texan | Grapevine, Texas
Walk in ready to learn. Walk out ready to deliver. This is the data and AI conference you can't afford to miss.
Register now and save with the early bird rate—just $795!
Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.
Find more tutorials on the SAS Users YouTube channel.