BookmarkSubscribeRSS Feed
deleted_user
Not applicable
Hi,
I am working on trying to find bad email addresses in a file and i am using the following code
data lines
adsf@mekto.com adre@mekto.com
adsf@mekto.com adr
erese@mekel@sdfsd.com
my@my.com
23432@295.com
my~@dsf.ca
myse@sere
@com.ec
fe!@d.com
adfe'j@cfe.ca
adfsd@nl.jjle.ca

data work.testlht3;
set work.mytest;
if _n_=1 then do;
re= prxparse("/((\w|\.|\-)+@(\w|\.|\-))+/");
end;
retain re;
if ^prxmatch(re,email) then LHT=0;
else LHT=1;
run;

This only finds problems with my~@dsf.ca @com.ec fe!@d.com
My problem is when there is email addresses with multiple @ this does not pick it up. Also if there are spaces in the email.
These are the test addreses I am using that I want to pick up as wrong but I can't seem to get the correct Perl statement :
adsf@mekto.com adre@mekto.com (has basically 2 addresses in the field)
adsf@mekto.com adr (has space and then text)
erese@mekel@sdfsd.com (multiple @)
myse@sere (no .com)
adfe'j@cfe.ca (there is a ' in this address)

I have also tried the following statement but it doesn't work the way I need either.
We do have valid addresses like mh@nxe.ener.ds.com not just mm@mse.com

prxparse('/ \w[-.\w]*\@[-\w]+(\.[-\w]+)*\.(ca|com|edu|gov|int|mil|net|org|biz|info|name|mu­seum|coop|aero|[a-z][a-z]) /i');

Any ideas??
Thanks
3 REPLIES 3
deleted_user
Not applicable
The way to eat an elephant is one bite at a time.
Don't run yourself into the ground trying to make one thing do everything.
test the pattern
test for multiple '@' separately
test for imbedded spaces separately.

As an example,
When I test a textually provided date, I have a sequence of steps
1) is it in a prescribed format, e.g. yyyy-mm-dd = '....-..-..'
In this case
2) are the fields numeric?
3) is 01 LE mm LE 12
4) is 01 LE dd LE 31
5) for a given month, is dd within that month's proper range -- jan LT 32, apr LT 31; I have already determined it is GT 0.

This simplifies the parsing, and improves my error responses to being more specific to what is wrong, as opposed to just "invalid date".

To count the number of '@' that exist in a string, use either the SAS count or countc functions.

You can use INDEX, INDEXC, COUNT, COUNTC or ANYSPACE to indentify spaces. ANYSPACE identifies white space -- tab, space, carriage return. Message was edited by: Chuck
deleted_user
Not applicable
Thanks Chuck. I was just trying to be as efficent as possible as I will be doing this for millions of addresses. I took your advice and now have it doing what I need.


Can anyone make it more efficent than this?

data work.testlht3;
set work.mytest;
if _n_=1 then do;
re= prxparse("/((\w|\.|\-)+@(\w|\.|\-)+\.(\w))+/");
end;
retain re;
if ^prxmatch(re,email) then LHT=0;
else LHT=1;
multiple_at=countc(email,'@');
badspace=ANYSPACE(email);
elength=length(email);
extraspace=(elength-badspace);
quotescan=index(email,"'");
if (quotescan>0 or extraspace>0 or lht=0 or multiple_at ne 1) then bademail=1;
else bademail=0;
run;
deleted_user
Not applicable
Be careful with the space thing.

I would use
[pre]
extraspace = anyspace(trim(left(email)));
[/pre]

You'll be surprised at how fast SAS can blow through a multi-million observation data set, even on a PC these days.

And, you are welcome.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 3 replies
  • 792 views
  • 0 likes
  • 1 in conversation