Finding bad email address with prxparse how do you stop multiple @

deleted_user · Posted 05-07-2008 12:05 PM

Hi,
I am working on trying to find bad email addresses in a file and i am using the following code
data lines
adsf@mekto.com adre@mekto.com
adsf@mekto.com adr
erese@mekel@sdfsd.com
my@my.com
23432@295.com
my~@dsf.ca
myse@sere
@com.ec
fe!@d.com
adfe'j@cfe.ca
adfsd@nl.jjle.ca

data work.testlht3;
set work.mytest;
if _n_=1 then do;
re= prxparse("/((\w|\.|\-)+@(\w|\.|\-))+/");
end;
retain re;
if ^prxmatch(re,email) then LHT=0;
else LHT=1;
run;

This only finds problems with my~@dsf.ca @com.ec fe!@d.com
My problem is when there is email addresses with multiple @ this does not pick it up. Also if there are spaces in the email.
These are the test addreses I am using that I want to pick up as wrong but I can't seem to get the correct Perl statement :
adsf@mekto.com adre@mekto.com (has basically 2 addresses in the field)
adsf@mekto.com adr (has space and then text)
erese@mekel@sdfsd.com (multiple @)
myse@sere (no .com)
adfe'j@cfe.ca (there is a ' in this address)

I have also tried the following statement but it doesn't work the way I need either.
We do have valid addresses like mh@nxe.ener.ds.com not just mm@mse.com

prxparse('/ \w[-.\w]*\@[-\w]+(\.[-\w]+)*\.(ca|com|edu|gov|int|mil|net|org|biz|info|name|museum|coop|aero|[a-z][a-z]) /i');

Any ideas??
Thanks

deleted_user · Posted 05-07-2008 12:23 PM

The way to eat an elephant is one bite at a time.
Don't run yourself into the ground trying to make one thing do everything.
test the pattern
test for multiple '@' separately
test for imbedded spaces separately.

As an example,
When I test a textually provided date, I have a sequence of steps
1) is it in a prescribed format, e.g. yyyy-mm-dd = '....-..-..'
In this case
2) are the fields numeric?
3) is 01 LE mm LE 12
4) is 01 LE dd LE 31
5) for a given month, is dd within that month's proper range -- jan LT 32, apr LT 31; I have already determined it is GT 0.

This simplifies the parsing, and improves my error responses to being more specific to what is wrong, as opposed to just "invalid date".

To count the number of '@' that exist in a string, use either the SAS count or countc functions.

You can use INDEX, INDEXC, COUNT, COUNTC or ANYSPACE to indentify spaces. ANYSPACE identifies white space -- tab, space, carriage return. Message was edited by: Chuck

deleted_user · Posted 05-07-2008 01:07 PM

Thanks Chuck. I was just trying to be as efficent as possible as I will be doing this for millions of addresses. I took your advice and now have it doing what I need.

Can anyone make it more efficent than this?

data work.testlht3;
set work.mytest;
if _n_=1 then do;
re= prxparse("/((\w|\.|\-)+@(\w|\.|\-)+\.(\w))+/");
end;
retain re;
if ^prxmatch(re,email) then LHT=0;
else LHT=1;
multiple_at=countc(email,'@');
badspace=ANYSPACE(email);
elength=length(email);
extraspace=(elength-badspace);
quotescan=index(email,"'");
if (quotescan>0 or extraspace>0 or lht=0 or multiple_at ne 1) then bademail=1;
else bademail=0;
run;

deleted_user · Posted 05-07-2008 02:53 PM

Be careful with the space thing.

I would use
[pre]
extraspace = anyspace(trim(left(email)));
[/pre]

You'll be surprised at how fast SAS can blow through a multi-million observation data set, even on a PC these days.

And, you are welcome.

Finding bad email address with prxparse how do you stop multiple @

Re: Finding bad email address with prxparse how do you stop multiple @

Re: Finding bad email address with prxparse how do you stop multiple @

Re: Finding bad email address with prxparse how do you stop multiple @

Finding bad email address with prxparse how do you stop multiple @

Re: Finding bad email address with prxparse how do you stop multiple @

Re: Finding bad email address with prxparse how do you stop multiple @

Re: Finding bad email address with prxparse how do you stop multiple @

Registration is open

SAS Training: Just a Click Away