11-20-2015 11:06 AM
Dear SAS Community
I would like to be able to link accounts with unique e-mails which are similar to each other, for example:
Account 1. email@example.com
Accoun 2. firstname.lastname@example.org
Account 3. email@example.com
Account 4. firstname.lastname@example.org
The ultimate goal would be to create a summary table which would say that based on the example above, we are dealing withL
- 1 account holder (1 person responsible for creating all accounts) linked with 4 similar emails.
- Or we can summarise it as 4 accounts linked with 1 e-mail (so we are still assuming 1 person responsible for creating all accounts, but this time we are saying that four accounts were created using the same (as in almost identical) e-mail address.
I came across SAS pdf titled "Using Edit-Distance Functions to Identify “Similar” E-Mail Addresses" which discusses SPEDIS, COMPLEV, COMPEGED procedures. Unfortunately, things discussed there are quite vague and I prorably need slightly more basic tutorial, so I was wondering whether there is any standard query that would meet my requirements (i.e. summarise it in the above-described way) if I applied it to tens of thousands of e-mails address.
11-28-2015 09:39 AM
It is a difficult question to answer , I tried Soundex function .But for ten thousands different names you have to find a different approach , I try to find and post the answer in any.
input Emailid$ 30.;
Proc Print data = newdata;