Hi All,
Forgive me as I am a new SAS EG user (version 7.1); I have been assigned a data scrubbing task and it's been requested that I use SAS EG in order to do so.
I've been moving forward fairly well; however, I've encountered a problem that I cannot seem to intuitively resolve. I have several data sets which all have email addresses. Some of my email addresses contain junk characters which I have used the compress function to remove.
Unfortunately, I also have email addresses where I'd like to remove some characters based on where in the email address they are positioned.
Here are 2 examples:
Email Address: ..johndoe@gmail.com
Email Address: johndoe@gmail.com..
Obviously you can see what I need to do.
I'd like to use an advanced expression to
1. Remove all non-alphabetic characters before the first alphabetic character in the email string
2. Remove all characters after ".com"
Thoughts?
Thanks in advance,
Tessa
Before blindly going after "everything after .com" you may want to verify that you have nothing like
bill@this.companyname.com in your data.
Fair statement- without combing through, manually, hundreds of thousands of rows of data, do you have a solution that can help me out?
I also realized that I will have other situations like .gov, .ca, .edu....so perhaps the question is more to remove non alphabetic characters at the end of the email string, regardless of email info.
Make sense?
I appreciate the perspective; it's always helpful to get some insight to a problem. However, I don't appreciate an implication that I'm "blindly" going into a dataset without a variety of considerations.
Thanks
I don't think @ballardw meant anything by that, other than to be helpful. As someone who also works with e-mail addresses, I'm always surprised by the number of variations caused by "subdomains" in a single organization. E-mail addresses have an @ symbol and end with a "dot something", but beyond that they seem to defy your typical expected formulas.
@tessa_h wrote:
... when I have used forums like this for help in the past, there seems to be a population of elitist types who enjoy pointing out the errors that others have made...just to enhance a grandiose sense of self.
Welcome to the SAS Support Communities -- most of the experts here answer with humility, and a genuine desire to help. Once in a while there is a language/culture barrier (it's a global community), but if you assume that people have the best intentions, you shouldn't be disappointed.
Regarding your challenge, this paper might help.
I did not mean anything negative about the "blind" comment. We have new posters here with very wide differences in experience. It is not uncommon to see a "desired result" phrasing that may not do what is actually intended as specified. Often the statement is incomplete or misses a boundary condition such that the suggestions made by community members does exactly what was requested it is not what the poster actually intended.
Often a reminder to look at all of the data helps clear up rules specifications.
I do sympathize as I have spent lots of time scrubbing data in a fairly large range of topics. One of my favorite examples is before a survey we provided the data recorders instructions about how to enter expected names. One example International Buisness Machines or IBM in any form was to be entered as "IBM". At this point take a short break and think about how many ways that may have been actually entered into the data.
If your guess was over 10 you may be getting paranoid enough to be a good data scrubber. The actual count was 18 different forms with the most entertaining being I>B>M>. Which one will note immediately is the result of holding the shift key down and typing a period after each letter. Note that the original instructions specifically did NOT include a period. Other spellings involved random shifts of capitalization combined with one or more periods after letters (not always all 3 letters).
So when I saw a request similar to "trim everything after .com" I thought it may be appropriate to remind other valid values may exist. And your response about .gov and such showed that you picked up on the issue quickly.
Ohhh yes, I'm very aware of the billion different ways a person can enter in info...back in the day when all I DID was scrub data...how many times to you think the word "feline" can be messed up?
feline
feelin
feeline
feilin
filin
feillin
...
Yeah...that was always fun... 😐
using regular expressions probably better then below... what about numbers at beginning of email address like "123test@yahoo.com"? If numbers need to be kept replace 'anyalpha' with 'anyalnum'.
data email_1;
length email $40;
infile cards;
input email $;
cards;
..johndoe@gmail.com
johndoe@gmail.com..
;
run;
proc sql;
create table email_1a as select
t1.email,
reverse(substr(reverse(trim(substr(t1.email,anyalpha(t1.email),length(t1.email) - anyalpha(t1.email)+1))),anyalpha(reverse(trim(substr(t1.email,anyalpha(t1.email),length(t1.email) - anyalpha(t1.email)+1)))),length(reverse(trim(substr(t1.email,anyalpha(t1.email),length(t1.email) - anyalpha(t1.email)+1)))) - anyalpha(reverse(trim(substr(t1.email,anyalpha(t1.email),length(t1.email) - anyalpha(t1.email)+1))))+1)) as email_corr
from email_1 t1
;
quit;
As a start you could use the compress function, like below, this will remove all characters, except alphabetic, digits and punctuation anywhere in your email. Maybe one should also check what makes up a valid email address, see here https://en.wikipedia.org/wiki/Email_address
Maybe there are Regular expressions that check for valid email addresses as well. You could use those with the PRX... functions.
Also have a look at this discussion
https://communities.sas.com/t5/SAS-Procedures/validate-email-address/m-p/37459#U37459
data have;
length email newEmail $ 256;
email = cats("090a0d"x, "sugus-sugus_sugus.1234@sugus.com.edu", "01"x);
newEmail = compress(email, "" , "adpk");
putlog _all_;
run;
Bruno
Thanks, Bruno, for your advice...
Couple of things- I'm already using the compress function to remove some characters. This has been successful so far...
Second, the link to the discussion has an article in it- the link goes to an error page. Not too useful 😞
A colleague of mine pointed me to this article, however, as a new to SAS user, it will take some time to fully understand and implement how this ought to work for the project I am working on.
To be perfectly honest as well, I'm also new to SQL and sadly, I've been tossed into the fire. Data scrubbing isn't a normal job function of mine and it sort of landed on my plate. However, I'm always keen to add more skill sets...so please forgive my lack of knowledge as I get through this...I am well aware that I'm in over my head.
You've definitely been tossed in. I just looked up the rules for valid versus invalid email address strings, and it's pretty complicated.
Here's a reference to the rules:
https://en.wikipedia.org/wiki/Email_address
I'll see what I can do over the next few days...right now, I'm under a tight deadline. Regular expressions are the way to go...note that if you can find a regular expression example in Perl, it should translate well to SAS.
Tom
Cleansing email addresses is nasty business and you're starting here with a rather difficult task.
You don't have to use a lot of "SAS" though to get this job done. Perl Regular Expressions (RegEx) are great for pattern matchin and pattern replacement.
But first just to answer your initial question:
data have;
input email_address :$40.;
email_address_want=email_address;
/* Remove all non-alphabetic characters before the first alphabetic character in the email string */
email_address_want=prxchange('s/^\s*[[:^alpha:]]+//oi',1,email_address_want);
/* Remove all characters after ".com" */
email_address_want=prxchange('s/(?<=(\.com)).+$//oi',1,email_address_want);
/* Alternative: Remove all non-alphanumeric characters at end of the email address */
/* email_address_want=prxchange('s/[[:^alnum:]]+\s*$//oi',1,email_address_want);*/
datalines;
..johndoe@gmail.com
johndoe@gmail.com..
johndoe@gmail.com.au&%$.
*&^%$5467johndoe@gmail.org&%$.
;
run;
SAS implemented RegEx and it's available via SAS functions and call routines starting with PRX...
http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a003288497.htm
https://support.sas.com/rnd/base/datastep/perl_regexp/regexp-tip-sheet.pdf
RegEx are widely used and email verification is a common application for RegEx. You will find heaps of examples on the Internet.
What I would do is to first implement a RegEx for email verification - so a RegEx which defines the pattern for a valid email address (search the Internet for such a RegEx as starting point).
Then create a "bad' file with all email addresses not conforming to this pattern and start developing and testing your RegEx for cleansing.
As others already pointed out the actual RFC standard for valid email addresses is astonishingly wide. Depending on where your data comes from (like anything from the Internet or just a set of company email addresses) you will want to narrow down the pattern for valid email addresses to suit your data.
If you don't know RegEx syntax: It's a bit tidious but very worth learning.
Good luck!
Just to get you started, I pulled a simple Regex example off of the internet, from
http://www.regular-expressions.info/email.html
Here's a little piece of code that will use the example regex to divide the addresses into good or bad.
BTW, I can confirm what every has said about @ballardw. What was said was only in the spirit of being helpful; any other interpretation is due to the nature of the conversation.
Tom
data Good Bad;
if _n_ = 1
then do;
retain PRX1;
PRX1 = prxparse("/^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}\s*$/i");
drop PRX1;
end;
set Have;
if prxmatch(PRX1, TestString) then
output Good;
else output Bad;
run;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
What’s the difference between SAS Enterprise Guide and SAS Studio? How are they similar? Just ask SAS’ Danny Modlin.
Find more tutorials on the SAS Users YouTube channel.