- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
data temp;
input comments $50.;
datalines;
abc123xyz@gmail.com36uyt
wexyz@gmail.com890
subhaxyz@gmail.comsa
xyz@gmail.comgou
;
output should be:
output:
xyz@gmail.com
xyz@gmail.com
xyz@gmail.com
xyz@gmail.com
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi: Your task is going to be nearly impossible if there is nothing separating the email from the text or either side in the comment. It seems unlikely to me that there would be a comment string without any spaces at all in the comment and with an embedded email such as you show.
Emails do not always end in .com -- there can be .net or .biz or domains that end with a country (like yahoo.co.uk or something.ca)?
Since there is no restriction on the length of the first part of an email address, with your example and no spaces or delimiters around the mail, you don't know whether the email should be:
abc123xyz@gmail.com
bc123xyz@gmail.com
c123xyz@gmail.com
123xyz@gmail.com
23xyz@gmail.com
3xyz@gmail.com
xyz@gmail.com
yz@gmail.com
z@gmail.com
Any of those strings could be valid email addresses.
And, another challenge is that email domains are not always 2 strings after the @, like @gmail.com, there could also be several segments to the domain, as in: @mycompany.rr.com or @yahoo.co.uk or anything that is valid.
How are you getting the data? Is it possible to get the data with some kind of delimiter, even a space around the email address? To me, it seems more likely that you might have a comment like this:
abc123 xyz@gmail.com 36uyt
or like this:
some text commissioner@yahoo.co.uk more
or like this:
my email is: john.jacob.smith@wombat.rr.net and I like chocolate
and for any of those comments, since we know that there are not any spaces in an email address, then finding the email in the string is much more do-able if the comments have spaces or some kind of delimiter around the email as shown here:
For example, if there were spaces in the COMMENTS variable, then it is just a matter of using the SCAN and CATT functions to make the final email address from each string:
data temp;
length comments front back mailname domainchunk finalmail $70;
infile datalines dlm=',' dsd;
input linenum comments $;
loc_at = index(comments,'@');
** if loc_at gt 0 that means there is an email in the comment;
** What we know about email addresses: ;
** 1) there are not any spaces in an email address;
** 2) the @ sign separates the front email name from the domain;
** 3) if we find an @, then ;
** characters between the @ and space to the right is the domain;
** characters between the @ and the space to the left is the first part of the mail address;
** This approach only works if the email is separated in the string;
** with spaces from other text;
if loc_at gt 0 then do;
front = scan(comments,1,'@');
back = scan(comments,2,'@');
mailname = scan(front,-1,' ');
domainchunk = scan(back,1,' ');
finalmail = catt(mailname,'@',domainchunk);
end;
else do;
finalmail = 'No Mail address in Comments';
end;
return;
datalines;
1,"abc123 xyz@gmail.com 36uyt"
2,"we xyz@gmail.com 890"
3,"subha xyz@gmail.com sa"
4,"xyz@gmail.com gou"
5,"some text commissioner@yahoo.co.uk more"
6,"at end of string some.person@hotmail.com"
7,"there is not an email address in this comment"
8,"my email is: john.jacob.smith@wombat.rr.net and I like chocolate"
;
run;
proc print data=temp;
var linenum comments loc_at front back mailname domainchunk finalmail;
run;
But without spaces in the comment, I don't understand how you'd approach the extraction of the email address.
Cynthia
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
data want (drop=comments);
set temp;
email=substr(comments,find(comments,'xyz@gmail.com'),length('xyz@gmail.com'));
run;
Is the email always going to be the same? If not, you could do something similar looking for the @ and then the first instance of .com (or .edu or whatever after that.)
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi: Your task is going to be nearly impossible if there is nothing separating the email from the text or either side in the comment. It seems unlikely to me that there would be a comment string without any spaces at all in the comment and with an embedded email such as you show.
Emails do not always end in .com -- there can be .net or .biz or domains that end with a country (like yahoo.co.uk or something.ca)?
Since there is no restriction on the length of the first part of an email address, with your example and no spaces or delimiters around the mail, you don't know whether the email should be:
abc123xyz@gmail.com
bc123xyz@gmail.com
c123xyz@gmail.com
123xyz@gmail.com
23xyz@gmail.com
3xyz@gmail.com
xyz@gmail.com
yz@gmail.com
z@gmail.com
Any of those strings could be valid email addresses.
And, another challenge is that email domains are not always 2 strings after the @, like @gmail.com, there could also be several segments to the domain, as in: @mycompany.rr.com or @yahoo.co.uk or anything that is valid.
How are you getting the data? Is it possible to get the data with some kind of delimiter, even a space around the email address? To me, it seems more likely that you might have a comment like this:
abc123 xyz@gmail.com 36uyt
or like this:
some text commissioner@yahoo.co.uk more
or like this:
my email is: john.jacob.smith@wombat.rr.net and I like chocolate
and for any of those comments, since we know that there are not any spaces in an email address, then finding the email in the string is much more do-able if the comments have spaces or some kind of delimiter around the email as shown here:
For example, if there were spaces in the COMMENTS variable, then it is just a matter of using the SCAN and CATT functions to make the final email address from each string:
data temp;
length comments front back mailname domainchunk finalmail $70;
infile datalines dlm=',' dsd;
input linenum comments $;
loc_at = index(comments,'@');
** if loc_at gt 0 that means there is an email in the comment;
** What we know about email addresses: ;
** 1) there are not any spaces in an email address;
** 2) the @ sign separates the front email name from the domain;
** 3) if we find an @, then ;
** characters between the @ and space to the right is the domain;
** characters between the @ and the space to the left is the first part of the mail address;
** This approach only works if the email is separated in the string;
** with spaces from other text;
if loc_at gt 0 then do;
front = scan(comments,1,'@');
back = scan(comments,2,'@');
mailname = scan(front,-1,' ');
domainchunk = scan(back,1,' ');
finalmail = catt(mailname,'@',domainchunk);
end;
else do;
finalmail = 'No Mail address in Comments';
end;
return;
datalines;
1,"abc123 xyz@gmail.com 36uyt"
2,"we xyz@gmail.com 890"
3,"subha xyz@gmail.com sa"
4,"xyz@gmail.com gou"
5,"some text commissioner@yahoo.co.uk more"
6,"at end of string some.person@hotmail.com"
7,"there is not an email address in this comment"
8,"my email is: john.jacob.smith@wombat.rr.net and I like chocolate"
;
run;
proc print data=temp;
var linenum comments loc_at front back mailname domainchunk finalmail;
run;
But without spaces in the comment, I don't understand how you'd approach the extraction of the email address.
Cynthia
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi Cynthia,
I'm with you. I was thinking the same thing when I approached the question and I only answered the question at hand. My only thought is they have a list of known emails addresses and will end up looping through comments to determine if it exists anywhere in the string and aren't trying to pull out an assortment of email addresses. Because yeah, that will be impossible if the comment is structured like that and they don't have anything to notate the start and end. Thanks for posting the code you did. I'm going to grab that - I'm sure it will come in handy some day!
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thanks Cynthia
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi Cynthia.
I just wanted to thank you for this post. I know it a bit dated now but it's exactly what I was looking for. Solved my problem perfectly.
Thank You.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thanks! I'm glad the code was useful to you. That is the beauty of SAS character functions, they just keep working.
Cynthia