Help using Base SAS procedures

Finding bad email address with prxparse how do you stop multiple @

Reply
N/A
Posts: 0

Finding bad email address with prxparse how do you stop multiple @

Hi,
I am working on trying to find bad email addresses in a file and i am using the following code
data lines
adsf@mekto.com adre@mekto.com
adsf@mekto.com adr
erese@mekel@sdfsd.com
my@my.com
23432@295.com
my~@dsf.ca
myse@sere
@com.ec
fe!@d.com
adfe'j@cfe.ca
adfsd@nl.jjle.ca

data work.testlht3;
set work.mytest;
if _n_=1 then do;
re= prxparse("/((\w|\.|\-)+@(\w|\.|\-))+/");
end;
retain re;
if ^prxmatch(re,email) then LHT=0;
else LHT=1;
run;

This only finds problems with my~@dsf.ca @com.ec fe!@d.com
My problem is when there is email addresses with multiple @ this does not pick it up. Also if there are spaces in the email.
These are the test addreses I am using that I want to pick up as wrong but I can't seem to get the correct Perl statement :
adsf@mekto.com adre@mekto.com (has basically 2 addresses in the field)
adsf@mekto.com adr (has space and then text)
erese@mekel@sdfsd.com (multiple @)
myse@sere (no .com)
adfe'j@cfe.ca (there is a ' in this address)

I have also tried the following statement but it doesn't work the way I need either.
We do have valid addresses like mh@nxe.ener.ds.com not just mm@mse.com

prxparse('/ \w[-.\w]*\@[-\w]+(\.[-\w]+)*\.(ca|com|edu|gov|int|mil|net|org|biz|info|name|mu­seum|coop|aero|[a-z][a-z]) /i');

Any ideas??
Thanks
N/A
Posts: 0

Re: Finding bad email address with prxparse how do you stop multiple @

The way to eat an elephant is one bite at a time.
Don't run yourself into the ground trying to make one thing do everything.
test the pattern
test for multiple '@' separately
test for imbedded spaces separately.

As an example,
When I test a textually provided date, I have a sequence of steps
1) is it in a prescribed format, e.g. yyyy-mm-dd = '....-..-..'
In this case
2) are the fields numeric?
3) is 01 LE mm LE 12
4) is 01 LE dd LE 31
5) for a given month, is dd within that month's proper range -- jan LT 32, apr LT 31; I have already determined it is GT 0.

This simplifies the parsing, and improves my error responses to being more specific to what is wrong, as opposed to just "invalid date".

To count the number of '@' that exist in a string, use either the SAS count or countc functions.

You can use INDEX, INDEXC, COUNT, COUNTC or ANYSPACE to indentify spaces. ANYSPACE identifies white space -- tab, space, carriage return. Message was edited by: Chuck
N/A
Posts: 0

Re: Finding bad email address with prxparse how do you stop multiple @

Thanks Chuck. I was just trying to be as efficent as possible as I will be doing this for millions of addresses. I took your advice and now have it doing what I need.


Can anyone make it more efficent than this?

data work.testlht3;
set work.mytest;
if _n_=1 then do;
re= prxparse("/((\w|\.|\-)+@(\w|\.|\-)+\.(\w))+/");
end;
retain re;
if ^prxmatch(re,email) then LHT=0;
else LHT=1;
multiple_at=countc(email,'@');
badspace=ANYSPACE(email);
elength=length(email);
extraspace=(elength-badspace);
quotescan=index(email,"'");
if (quotescan>0 or extraspace>0 or lht=0 or multiple_at ne 1) then bademail=1;
else bademail=0;
run;
N/A
Posts: 0

Re: Finding bad email address with prxparse how do you stop multiple @

Be careful with the space thing.

I would use
[pre]
extraspace = anyspace(trim(left(email)));
[/pre]

You'll be surprised at how fast SAS can blow through a multi-million observation data set, even on a PC these days.

And, you are welcome.
Ask a Question
Discussion stats
  • 3 replies
  • 119 views
  • 0 likes
  • 1 in conversation