- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi All,
Forgive me as I am a new SAS EG user (version 7.1); I have been assigned a data scrubbing task and it's been requested that I use SAS EG in order to do so.
I've been moving forward fairly well; however, I've encountered a problem that I cannot seem to intuitively resolve. I have several data sets which all have email addresses. Some of my email addresses contain junk characters which I have used the compress function to remove.
Unfortunately, I also have email addresses where I'd like to remove some characters based on where in the email address they are positioned.
Here are 2 examples:
Email Address: ..johndoe@gmail.com
Email Address: johndoe@gmail.com..
Obviously you can see what I need to do.
I'd like to use an advanced expression to
1. Remove all non-alphabetic characters before the first alphabetic character in the email string
2. Remove all characters after ".com"
Thoughts?
Thanks in advance,
Tessa
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Before blindly going after "everything after .com" you may want to verify that you have nothing like
bill@this.companyname.com in your data.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Fair statement- without combing through, manually, hundreds of thousands of rows of data, do you have a solution that can help me out?
I also realized that I will have other situations like .gov, .ca, .edu....so perhaps the question is more to remove non alphabetic characters at the end of the email string, regardless of email info.
Make sense?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I appreciate the perspective; it's always helpful to get some insight to a problem. However, I don't appreciate an implication that I'm "blindly" going into a dataset without a variety of considerations.
Thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I don't think @ballardw meant anything by that, other than to be helpful. As someone who also works with e-mail addresses, I'm always surprised by the number of variations caused by "subdomains" in a single organization. E-mail addresses have an @ symbol and end with a "dot something", but beyond that they seem to defy your typical expected formulas.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@tessa_h wrote:
... when I have used forums like this for help in the past, there seems to be a population of elitist types who enjoy pointing out the errors that others have made...just to enhance a grandiose sense of self.
Welcome to the SAS Support Communities -- most of the experts here answer with humility, and a genuine desire to help. Once in a while there is a language/culture barrier (it's a global community), but if you assume that people have the best intentions, you shouldn't be disappointed.
Regarding your challenge, this paper might help.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I did not mean anything negative about the "blind" comment. We have new posters here with very wide differences in experience. It is not uncommon to see a "desired result" phrasing that may not do what is actually intended as specified. Often the statement is incomplete or misses a boundary condition such that the suggestions made by community members does exactly what was requested it is not what the poster actually intended.
Often a reminder to look at all of the data helps clear up rules specifications.
I do sympathize as I have spent lots of time scrubbing data in a fairly large range of topics. One of my favorite examples is before a survey we provided the data recorders instructions about how to enter expected names. One example International Buisness Machines or IBM in any form was to be entered as "IBM". At this point take a short break and think about how many ways that may have been actually entered into the data.
If your guess was over 10 you may be getting paranoid enough to be a good data scrubber. The actual count was 18 different forms with the most entertaining being I>B>M>. Which one will note immediately is the result of holding the shift key down and typing a period after each letter. Note that the original instructions specifically did NOT include a period. Other spellings involved random shifts of capitalization combined with one or more periods after letters (not always all 3 letters).
So when I saw a request similar to "trim everything after .com" I thought it may be appropriate to remind other valid values may exist. And your response about .gov and such showed that you picked up on the issue quickly.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Ohhh yes, I'm very aware of the billion different ways a person can enter in info...back in the day when all I DID was scrub data...how many times to you think the word "feline" can be messed up?
feline
feelin
feeline
feilin
filin
feillin
...
Yeah...that was always fun... 😐
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
using regular expressions probably better then below... what about numbers at beginning of email address like "123test@yahoo.com"? If numbers need to be kept replace 'anyalpha' with 'anyalnum'.
data email_1;
length email $40;
infile cards;
input email $;
cards;
..johndoe@gmail.com
johndoe@gmail.com..
;
run;
proc sql;
create table email_1a as select
t1.email,
reverse(substr(reverse(trim(substr(t1.email,anyalpha(t1.email),length(t1.email) - anyalpha(t1.email)+1))),anyalpha(reverse(trim(substr(t1.email,anyalpha(t1.email),length(t1.email) - anyalpha(t1.email)+1)))),length(reverse(trim(substr(t1.email,anyalpha(t1.email),length(t1.email) - anyalpha(t1.email)+1)))) - anyalpha(reverse(trim(substr(t1.email,anyalpha(t1.email),length(t1.email) - anyalpha(t1.email)+1))))+1)) as email_corr
from email_1 t1
;
quit;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
As a start you could use the compress function, like below, this will remove all characters, except alphabetic, digits and punctuation anywhere in your email. Maybe one should also check what makes up a valid email address, see here https://en.wikipedia.org/wiki/Email_address
Maybe there are Regular expressions that check for valid email addresses as well. You could use those with the PRX... functions.
Also have a look at this discussion
https://communities.sas.com/t5/SAS-Procedures/validate-email-address/m-p/37459#U37459
data have;
length email newEmail $ 256;
email = cats("090a0d"x, "sugus-sugus_sugus.1234@sugus.com.edu", "01"x);
newEmail = compress(email, "" , "adpk");
putlog _all_;
run;
Bruno
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thanks, Bruno, for your advice...
Couple of things- I'm already using the compress function to remove some characters. This has been successful so far...
Second, the link to the discussion has an article in it- the link goes to an error page. Not too useful 😞
A colleague of mine pointed me to this article, however, as a new to SAS user, it will take some time to fully understand and implement how this ought to work for the project I am working on.
To be perfectly honest as well, I'm also new to SQL and sadly, I've been tossed into the fire. Data scrubbing isn't a normal job function of mine and it sort of landed on my plate. However, I'm always keen to add more skill sets...so please forgive my lack of knowledge as I get through this...I am well aware that I'm in over my head.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
You've definitely been tossed in. I just looked up the rules for valid versus invalid email address strings, and it's pretty complicated.
Here's a reference to the rules:
https://en.wikipedia.org/wiki/Email_address
I'll see what I can do over the next few days...right now, I'm under a tight deadline. Regular expressions are the way to go...note that if you can find a regular expression example in Perl, it should translate well to SAS.
Tom
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Cleansing email addresses is nasty business and you're starting here with a rather difficult task.
You don't have to use a lot of "SAS" though to get this job done. Perl Regular Expressions (RegEx) are great for pattern matchin and pattern replacement.
But first just to answer your initial question:
data have;
input email_address :$40.;
email_address_want=email_address;
/* Remove all non-alphabetic characters before the first alphabetic character in the email string */
email_address_want=prxchange('s/^\s*[[:^alpha:]]+//oi',1,email_address_want);
/* Remove all characters after ".com" */
email_address_want=prxchange('s/(?<=(\.com)).+$//oi',1,email_address_want);
/* Alternative: Remove all non-alphanumeric characters at end of the email address */
/* email_address_want=prxchange('s/[[:^alnum:]]+\s*$//oi',1,email_address_want);*/
datalines;
..johndoe@gmail.com
johndoe@gmail.com..
johndoe@gmail.com.au&%$.
*&^%$5467johndoe@gmail.org&%$.
;
run;
SAS implemented RegEx and it's available via SAS functions and call routines starting with PRX...
http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a003288497.htm
https://support.sas.com/rnd/base/datastep/perl_regexp/regexp-tip-sheet.pdf
RegEx are widely used and email verification is a common application for RegEx. You will find heaps of examples on the Internet.
What I would do is to first implement a RegEx for email verification - so a RegEx which defines the pattern for a valid email address (search the Internet for such a RegEx as starting point).
Then create a "bad' file with all email addresses not conforming to this pattern and start developing and testing your RegEx for cleansing.
As others already pointed out the actual RFC standard for valid email addresses is astonishingly wide. Depending on where your data comes from (like anything from the Internet or just a set of company email addresses) you will want to narrow down the pattern for valid email addresses to suit your data.
If you don't know RegEx syntax: It's a bit tidious but very worth learning.
Good luck!
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Just to get you started, I pulled a simple Regex example off of the internet, from
http://www.regular-expressions.info/email.html
Here's a little piece of code that will use the example regex to divide the addresses into good or bad.
BTW, I can confirm what every has said about @ballardw. What was said was only in the spirit of being helpful; any other interpretation is due to the nature of the conversation.
Tom
data Good Bad;
if _n_ = 1
then do;
retain PRX1;
PRX1 = prxparse("/^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}\s*$/i");
drop PRX1;
end;
set Have;
if prxmatch(PRX1, TestString) then
output Good;
else output Bad;
run;