Re: SAS EG to Clean Email Addresses

tessa_h · Posted 10-14-2016 10:56 AM

Hi All,

Forgive me as I am a new SAS EG user (version 7.1); I have been assigned a data scrubbing task and it's been requested that I use SAS EG in order to do so.

I've been moving forward fairly well; however, I've encountered a problem that I cannot seem to intuitively resolve. I have several data sets which all have email addresses. Some of my email addresses contain junk characters which I have used the compress function to remove.

Unfortunately, I also have email addresses where I'd like to remove some characters based on where in the email address they are positioned.

Here are 2 examples:

Email Address: ..johndoe@gmail.com

Email Address: johndoe@gmail.com..

Obviously you can see what I need to do.

I'd like to use an advanced expression to

1. Remove all non-alphabetic characters before the first alphabetic character in the email string

2. Remove all characters after ".com"

Thoughts?

Thanks in advance,

Tessa

ballardw · Posted 10-14-2016 11:22 AM

Before blindly going after "everything after .com" you may want to verify that you have nothing like

bill@this.companyname.com in your data.

tessa_h · Posted 10-14-2016 11:25 AM

Fair statement- without combing through, manually, hundreds of thousands of rows of data, do you have a solution that can help me out?

I also realized that I will have other situations like .gov, .ca, .edu....so perhaps the question is more to remove non alphabetic characters at the end of the email string, regardless of email info.

Make sense?

tessa_h · Posted 10-14-2016 02:27 PM

I appreciate the perspective; it's always helpful to get some insight to a problem. However, I don't appreciate an implication that I'm "blindly" going into a dataset without a variety of considerations.

Thanks

ChrisHemedinger · Posted 10-14-2016 03:01 PM

I don't think @ballardw meant anything by that, other than to be helpful. As someone who also works with e-mail addresses, I'm always surprised by the number of variations caused by "subdomains" in a single organization. E-mail addresses have an @ symbol and end with a "dot something", but beyond that they seem to defy your typical expected formulas.

Become an Explorer! Join SAS Analytics Explorers to learn and complete challenges that earn rewards!

tessa_h · Posted 10-14-2016 03:11 PM

That's one of the downfalls of communicating via a forum like this- no tone, no facial expression, etc...regardless of the intention of the communication, it's critically important to consider how a message can be received. Not to mention that when I have used forums like this for help in the past, there seems to be a population of elitist types who enjoy pointing out the errors that others have made...just to enhance a grandiose sense of self. To avoid that type of issue, I included my level of knowledge using the program; I am the first to admit that I don't know what I don't know...I anticipate that moving forward with additional questions as they come up, I am sure to encounter these personality types. But to your point, it's pretty amazing the amount of considerations...overwhelming, in fact...and as I slowly remove layer after layer to this half a million row problem, I discover a new consideration...ie .com is one thing...but then...com, .edu, .gov, etc etc etc....there is no single easy solution. I appreciate help in this frustrating process that is not unique to me. 🙂

ChrisHemedinger · Posted 10-14-2016 03:21 PM

@tessa_h wrote:
... when I have used forums like this for help in the past, there seems to be a population of elitist types who enjoy pointing out the errors that others have made...just to enhance a grandiose sense of self.

Welcome to the SAS Support Communities -- most of the experts here answer with humility, and a genuine desire to help. Once in a while there is a language/culture barrier (it's a global community), but if you assume that people have the best intentions, you shouldn't be disappointed.

Regarding your challenge, this paper might help.

Become an Explorer! Join SAS Analytics Explorers to learn and complete challenges that earn rewards!

ballardw · Posted 10-14-2016 06:33 PM

I did not mean anything negative about the "blind" comment. We have new posters here with very wide differences in experience. It is not uncommon to see a "desired result" phrasing that may not do what is actually intended as specified. Often the statement is incomplete or misses a boundary condition such that the suggestions made by community members does exactly what was requested it is not what the poster actually intended.

Often a reminder to look at all of the data helps clear up rules specifications.

I do sympathize as I have spent lots of time scrubbing data in a fairly large range of topics. One of my favorite examples is before a survey we provided the data recorders instructions about how to enter expected names. One example International Buisness Machines or IBM in any form was to be entered as "IBM". At this point take a short break and think about how many ways that may have been actually entered into the data.

If your guess was over 10 you may be getting paranoid enough to be a good data scrubber. The actual count was 18 different forms with the most entertaining being I>B>M>. Which one will note immediately is the result of holding the shift key down and typing a period after each letter. Note that the original instructions specifically did NOT include a period. Other spellings involved random shifts of capitalization combined with one or more periods after letters (not always all 3 letters).

So when I saw a request similar to "trim everything after .com" I thought it may be appropriate to remind other valid values may exist. And your response about .gov and such showed that you picked up on the issue quickly.

tessa_h · Posted 10-17-2016 07:42 AM

Ohhh yes, I'm very aware of the billion different ways a person can enter in info...back in the day when all I DID was scrub data...how many times to you think the word "feline" can be messed up?

feline

feelin

feeline

feilin

filin

feillin

...

Yeah...that was always fun... 😐

cjinsf · Posted 10-14-2016 12:18 PM

using regular expressions probably better then below... what about numbers at beginning of email address like "123test@yahoo.com"? If numbers need to be kept replace 'anyalpha' with 'anyalnum'.

data email_1;
length email $40;
infile cards;
input email $;
cards;
..johndoe@gmail.com
johndoe@gmail.com..
;
run;

proc sql;
create table email_1a as select
t1.email,
reverse(substr(reverse(trim(substr(t1.email,anyalpha(t1.email),length(t1.email) - anyalpha(t1.email)+1))),anyalpha(reverse(trim(substr(t1.email,anyalpha(t1.email),length(t1.email) - anyalpha(t1.email)+1)))),length(reverse(trim(substr(t1.email,anyalpha(t1.email),length(t1.email) - anyalpha(t1.email)+1)))) - anyalpha(reverse(trim(substr(t1.email,anyalpha(t1.email),length(t1.email) - anyalpha(t1.email)+1))))+1)) as email_corr
from email_1 t1
;
quit;

BrunoMueller · Posted 10-14-2016 12:32 PM

As a start you could use the compress function, like below, this will remove all characters, except alphabetic, digits and punctuation anywhere in your email. Maybe one should also check what makes up a valid email address, see here https://en.wikipedia.org/wiki/Email_address

Maybe there are Regular expressions that check for valid email addresses as well. You could use those with the PRX... functions.

Also have a look at this discussion

https://communities.sas.com/t5/SAS-Procedures/validate-email-address/m-p/37459#U37459

data have;
  length email newEmail $ 256;
  email = cats("090a0d"x, "sugus-sugus_sugus.1234@sugus.com.edu", "01"x);

  newEmail = compress(email, "" , "adpk");
  putlog _all_;
run;

Bruno

tessa_h · Posted 10-14-2016 02:22 PM

Thanks, Bruno, for your advice...

Couple of things- I'm already using the compress function to remove some characters. This has been successful so far...

Second, the link to the discussion has an article in it- the link goes to an error page. Not too useful 😞

A colleague of mine pointed me to this article, however, as a new to SAS user, it will take some time to fully understand and implement how this ought to work for the project I am working on.

To be perfectly honest as well, I'm also new to SQL and sadly, I've been tossed into the fire. Data scrubbing isn't a normal job function of mine and it sort of landed on my plate. However, I'm always keen to add more skill sets...so please forgive my lack of knowledge as I get through this...I am well aware that I'm in over my head.

TomKari · Posted 10-14-2016 05:08 PM

You've definitely been tossed in. I just looked up the rules for valid versus invalid email address strings, and it's pretty complicated.

Here's a reference to the rules:

https://en.wikipedia.org/wiki/Email_address

I'll see what I can do over the next few days...right now, I'm under a tight deadline. Regular expressions are the way to go...note that if you can find a regular expression example in Perl, it should translate well to SAS.

Tom

Patrick · Posted 10-14-2016 09:33 PM

Cleansing email addresses is nasty business and you're starting here with a rather difficult task.

You don't have to use a lot of "SAS" though to get this job done. Perl Regular Expressions (RegEx) are great for pattern matchin and pattern replacement.

But first just to answer your initial question:

data have;
  input email_address :$40.;
  email_address_want=email_address;

  /* Remove all non-alphabetic characters before the first alphabetic character in the email string */
  email_address_want=prxchange('s/^\s*[[:^alpha:]]+//oi',1,email_address_want);

  /*  Remove all characters after ".com" */
  email_address_want=prxchange('s/(?<=(\.com)).+$//oi',1,email_address_want);

  /* Alternative: Remove all non-alphanumeric characters at end of the email address */
/*  email_address_want=prxchange('s/[[:^alnum:]]+\s*$//oi',1,email_address_want);*/
datalines;
..johndoe@gmail.com
johndoe@gmail.com..
johndoe@gmail.com.au&%$.
*&^%$5467johndoe@gmail.org&%$.
;
run;

SAS implemented RegEx and it's available via SAS functions and call routines starting with PRX...

https://support.sas.com/documentation/cdl/en/lefunctionsref/67960/HTML/default/viewer.htm#p0w6napahk...

http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a003288497.htm

https://support.sas.com/rnd/base/datastep/perl_regexp/regexp-tip-sheet.pdf

RegEx are widely used and email verification is a common application for RegEx. You will find heaps of examples on the Internet.

What I would do is to first implement a RegEx for email verification - so a RegEx which defines the pattern for a valid email address (search the Internet for such a RegEx as starting point).

Then create a "bad' file with all email addresses not conforming to this pattern and start developing and testing your RegEx for cleansing.

As others already pointed out the actual RFC standard for valid email addresses is astonishingly wide. Depending on where your data comes from (like anything from the Internet or just a set of company email addresses) you will want to narrow down the pattern for valid email addresses to suit your data.

If you don't know RegEx syntax: It's a bit tidious but very worth learning.

Good luck!

TomKari · Posted 10-15-2016 09:29 AM

Just to get you started, I pulled a simple Regex example off of the internet, from

http://www.regular-expressions.info/email.html

Here's a little piece of code that will use the example regex to divide the addresses into good or bad.

BTW, I can confirm what every has said about @ballardw. What was said was only in the spirit of being helpful; any other interpretation is due to the nature of the conversation.

Tom

data Good Bad;

if _n_ = 1

then do;

retain PRX1;

PRX1 = prxparse("/^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}\s*$/i");

drop PRX1;

end;

set Have;

if prxmatch(PRX1, TestString) then

output Good;

else output Bad;

run;

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away