BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
subhani4
Obsidian | Level 7

data temp;
input comments $50.;
datalines;
abc123xyz@gmail.com36uyt
wexyz@gmail.com890
subhaxyz@gmail.comsa
xyz@gmail.comgou
;

 

output should be:

 

output:
xyz@gmail.com

xyz@gmail.com

xyz@gmail.com

xyz@gmail.com

 

1 ACCEPTED SOLUTION

Accepted Solutions
Cynthia_sas
SAS Super FREQ

Hi: Your task is going to be nearly impossible if there is nothing separating the email from the text or either side in the comment. It seems unlikely to me that there would be a comment string without any spaces at all in the comment and with an embedded email such as you show.

Emails do not always end in .com -- there can be .net or .biz or domains that end with a country (like yahoo.co.uk or something.ca)?

  Since there is no restriction on the length of the first part of an email address, with your example and no spaces or delimiters around the mail, you don't know whether the email should be:
abc123xyz@gmail.com
bc123xyz@gmail.com
c123xyz@gmail.com
123xyz@gmail.com
23xyz@gmail.com
3xyz@gmail.com
xyz@gmail.com
yz@gmail.com
z@gmail.com

  Any of those strings could be valid email addresses.
 
And, another challenge is that email domains are not always 2 strings after the @, like @gmail.com, there could also be several segments to the domain, as in: @mycompany.rr.com or @yahoo.co.uk or anything that is valid.

  How are you getting the data? Is it possible to get the data with some kind of delimiter, even a space around the email address? To me, it seems more likely that you might have a comment like this:
abc123 xyz@gmail.com 36uyt
or like this:
some text commissioner@yahoo.co.uk more
or like this:
my email is: john.jacob.smith@wombat.rr.net and I like chocolate

and for any of those comments, since we know that there are not any spaces in an email address, then finding the email in the string is much more do-able if the comments have spaces or some kind of delimiter around the email as shown here:

parse_if_spaces.png

 For example, if there were spaces in the COMMENTS variable, then it is just a matter of using the SCAN and CATT functions to make the final email address from each string:

 

data temp;
length comments front back mailname domainchunk finalmail $70;
  infile datalines dlm=',' dsd;
  input linenum comments $;
  loc_at = index(comments,'@');
** if loc_at gt 0 that means there is an email in the comment;
** What we know about email addresses: ;
** 1) there are not any spaces in an email address;
** 2) the @ sign separates the front email name from the domain;
** 3) if we find an @, then ;
**    characters between the @ and space to the right is the domain;
**    characters between the @ and the space to the left is the first part of the mail address;
** This approach only works if the email is separated in the string;
** with spaces from other text;
  if loc_at gt 0 then do;
    front = scan(comments,1,'@');
    back = scan(comments,2,'@');
    mailname = scan(front,-1,' ');
    domainchunk = scan(back,1,' ');
    finalmail = catt(mailname,'@',domainchunk);
  end;
  else do;
    finalmail = 'No Mail address in Comments';
  end;
return;
datalines;
1,"abc123 xyz@gmail.com 36uyt"
2,"we xyz@gmail.com 890"
3,"subha xyz@gmail.com sa"
4,"xyz@gmail.com gou"
5,"some text commissioner@yahoo.co.uk more"
6,"at end of string some.person@hotmail.com"
7,"there is not an email address in this comment"
8,"my email is: john.jacob.smith@wombat.rr.net and I like chocolate"
;
run;

proc print data=temp;
  var linenum comments loc_at front back mailname domainchunk finalmail;
run;

 

   But without spaces in the comment, I don't understand how you'd approach the extraction of the email address.

 
Cynthia

View solution in original post

6 REPLIES 6
bobpep212
Quartz | Level 8
data want (drop=comments);
set temp;
email=substr(comments,find(comments,'xyz@gmail.com'),length('xyz@gmail.com'));
run;

Is the email always going to be the same? If not, you could do something similar looking for the @ and then the first instance of .com (or .edu or whatever after that.)

Cynthia_sas
SAS Super FREQ

Hi: Your task is going to be nearly impossible if there is nothing separating the email from the text or either side in the comment. It seems unlikely to me that there would be a comment string without any spaces at all in the comment and with an embedded email such as you show.

Emails do not always end in .com -- there can be .net or .biz or domains that end with a country (like yahoo.co.uk or something.ca)?

  Since there is no restriction on the length of the first part of an email address, with your example and no spaces or delimiters around the mail, you don't know whether the email should be:
abc123xyz@gmail.com
bc123xyz@gmail.com
c123xyz@gmail.com
123xyz@gmail.com
23xyz@gmail.com
3xyz@gmail.com
xyz@gmail.com
yz@gmail.com
z@gmail.com

  Any of those strings could be valid email addresses.
 
And, another challenge is that email domains are not always 2 strings after the @, like @gmail.com, there could also be several segments to the domain, as in: @mycompany.rr.com or @yahoo.co.uk or anything that is valid.

  How are you getting the data? Is it possible to get the data with some kind of delimiter, even a space around the email address? To me, it seems more likely that you might have a comment like this:
abc123 xyz@gmail.com 36uyt
or like this:
some text commissioner@yahoo.co.uk more
or like this:
my email is: john.jacob.smith@wombat.rr.net and I like chocolate

and for any of those comments, since we know that there are not any spaces in an email address, then finding the email in the string is much more do-able if the comments have spaces or some kind of delimiter around the email as shown here:

parse_if_spaces.png

 For example, if there were spaces in the COMMENTS variable, then it is just a matter of using the SCAN and CATT functions to make the final email address from each string:

 

data temp;
length comments front back mailname domainchunk finalmail $70;
  infile datalines dlm=',' dsd;
  input linenum comments $;
  loc_at = index(comments,'@');
** if loc_at gt 0 that means there is an email in the comment;
** What we know about email addresses: ;
** 1) there are not any spaces in an email address;
** 2) the @ sign separates the front email name from the domain;
** 3) if we find an @, then ;
**    characters between the @ and space to the right is the domain;
**    characters between the @ and the space to the left is the first part of the mail address;
** This approach only works if the email is separated in the string;
** with spaces from other text;
  if loc_at gt 0 then do;
    front = scan(comments,1,'@');
    back = scan(comments,2,'@');
    mailname = scan(front,-1,' ');
    domainchunk = scan(back,1,' ');
    finalmail = catt(mailname,'@',domainchunk);
  end;
  else do;
    finalmail = 'No Mail address in Comments';
  end;
return;
datalines;
1,"abc123 xyz@gmail.com 36uyt"
2,"we xyz@gmail.com 890"
3,"subha xyz@gmail.com sa"
4,"xyz@gmail.com gou"
5,"some text commissioner@yahoo.co.uk more"
6,"at end of string some.person@hotmail.com"
7,"there is not an email address in this comment"
8,"my email is: john.jacob.smith@wombat.rr.net and I like chocolate"
;
run;

proc print data=temp;
  var linenum comments loc_at front back mailname domainchunk finalmail;
run;

 

   But without spaces in the comment, I don't understand how you'd approach the extraction of the email address.

 
Cynthia

bobpep212
Quartz | Level 8

Hi Cynthia, 

I'm with you. I was thinking the same thing when I approached the question and I only answered the question at hand. My only thought is they have a list of known emails addresses and will end up looping through comments to determine if it exists anywhere in the string and aren't trying to pull out an assortment of email addresses. Because yeah, that will be impossible if the comment is structured like that and they don't have anything to notate the start and end.  Thanks for posting the code you did. I'm going to grab that - I'm sure it will come in handy some day!

subhani4
Obsidian | Level 7

Thanks Cynthia

Jeff_DOC
Pyrite | Level 9

Hi Cynthia.

 

I just wanted to thank you for this post. I know it a bit dated now but it's exactly what I was looking for. Solved my problem perfectly.

 

Thank You.

Cynthia_sas
SAS Super FREQ
Hi:
Thanks! I'm glad the code was useful to you. That is the beauty of SAS character functions, they just keep working.
Cynthia

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 6 replies
  • 4734 views
  • 2 likes
  • 4 in conversation