Text mining and content categorization

extracting part of a string

Accepted Solution Solved
Reply
Super Contributor
Posts: 425
Accepted Solution

extracting part of a string

Hi,

 

suppose I have a data set with datalines in the following manner:

 

the first site is www.abc.com

www.123.com is the second website.

 

I am trying to figure out how to extract the websites, that is, the part of the string which is between (and including) "www." and ".com"

 

Thank you!


Accepted Solutions
Solution
‎02-13-2016 07:07 PM
Super Contributor
Posts: 490

Re: extracting part of a string

[ Edited ]

one way

data have;
input x $37.;
cards;
the first site is www.abc.com
www.187878723.com is the second website.
www.computer.com is the second website.
;
run;

data want ;
set have;
www=index(x, "www.");
com=index(substr(x,www+4), ".com");
if www and com then website=substr(x,www,com+7);
drop www com;
run;

View solution in original post


All Replies
Solution
‎02-13-2016 07:07 PM
Super Contributor
Posts: 490

Re: extracting part of a string

[ Edited ]

one way

data have;
input x $37.;
cards;
the first site is www.abc.com
www.187878723.com is the second website.
www.computer.com is the second website.
;
run;

data want ;
set have;
www=index(x, "www.");
com=index(substr(x,www+4), ".com");
if www and com then website=substr(x,www,com+7);
drop www com;
run;
Super Contributor
Posts: 490

Re: extracting part of a string

updated to handle more cases

Super Contributor
Posts: 425

Re: extracting part of a string

Hi Mohamed,

 

thank you for answering my question, everything works nicely!

 

Just on the sidenote, if I add a dataline " a pseudo site www. name .com", the code will still select "www. name .com" into want, but it isn't a real website becasue of the space after www. and before .com

So is there a way to avoide it by specifying that right after www. and right before .com there should be a character?

 

Thank you! 

Super Contributor
Posts: 490

Re: extracting part of a string

if www and com then website=compress(substr(x,www,com+7));
Super Contributor
Posts: 490

Re: extracting part of a string

Do you still want to extreact it correctly? .... or to consider it wrong and neglect it?

Super Contributor
Posts: 425

Re: extracting part of a string

I ran the code with your new input and understand that it corrects it.

 

Could you please also show me the option to neglect such a case?

 

thnak you!

Super Contributor
Posts: 490

Re: extracting part of a string

data want ;
set have;
www=index(x, "www.");
com=index(substr(x,www+4), ".com");
website=substr(x,www,com+7);
if www and com;
if index(trim(website),' ')> 0 then website="";
drop www com;
run;
☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 7 replies
  • 875 views
  • 3 likes
  • 2 in conversation