BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
kaziumair
Quartz | Level 8

Hi , everyone I am trying to extract an article from a website , the web page has the paragraphs in the following structure:

<p>Our

universal abuse of natural resources has created an imbalance in nature, contributing to the beginning of extinction.

Over the last 100 years over 500 species have already gone extinct.

If we do not act now many more animals face extinction over the next 30 years including Orangutans, Rhinos, Polar Bears, Gorillas, Gibbons, Chimpanzees to name a few.</p>

 

How do I extract the content within the html tags?

1 ACCEPTED SOLUTION

Accepted Solutions
Kurt_Bremser
Super User

I modified the code, now it can also deal with lines where <p> is not at position 1 and </p> not at the end of the line, or where both appear on one line (it does not handle cases where more than one paragraph appears on one input line):

data want;
infile datalines truncover;
retain flag 0;
input line $200.;
pos = index(line,'<p>');
if pos
then do;
  flag = 1;
  line = substr(line,pos + 3);
end;
pos = index(line,'</p');
if pos
then do;
  line = substr(line,1,pos - 1);
  output;
  flag = 0;
end;
else if flag then output;
drop flag pos;
datalines;
some uninteresting text
<p>Our
universal abuse of natural resources has created an imbalance in nature, contributing to the beginning of extinction. Over the last 100 years over 500 species have already gone extinct.
If we do not act now many more animals face extinction over the next 30 years including Orangutans, Rhinos, Polar Bears, Gorillas, Gibbons, Chimpanzees to name a few.</p>
xxx<p>this is a test</p>yyy
more uninteresting text
xxx<p>another
test</p>zzz
even more uninteresting text
;

 

View solution in original post

6 REPLIES 6
Kurt_Bremser
Super User

If the tags are always on an individual line, it's easy:

data want;
infile datalines truncover;
retain flag 0;
input line $200.;
if line = "<p>" then flag = 1;
else do;
  if line = "</p>" then flag = 0;
  if flag then output;
end;
drop flag;
datalines;
<p>
Our universal abuse of natural resources has created an imbalance in nature, contributing to the beginning of extinction. Over the last 100 years over 500 species have already gone extinct.
If we do not act now many more animals face extinction over the next 30 years including Orangutans, Rhinos, Polar Bears, Gorillas, Gibbons, Chimpanzees to name a few.
</p>
;
kaziumair
Quartz | Level 8
Hi , sorry I made a mistake in the example , the tags are not on individual lines, the paragraphs are similar to the following :
<p>Our
universal abuse of natural resources has created an imbalance in nature, contributing to the beginning of extinction. Over the last 100 years over 500 species have already gone extinct.
If we do not act now many more animals face extinction over the next 30 years including Orangutans, Rhinos, Polar Bears, Gorillas, Gibbons, Chimpanzees to name a few.</p>
Kurt_Bremser
Super User

I modified the code, now it can also deal with lines where <p> is not at position 1 and </p> not at the end of the line, or where both appear on one line (it does not handle cases where more than one paragraph appears on one input line):

data want;
infile datalines truncover;
retain flag 0;
input line $200.;
pos = index(line,'<p>');
if pos
then do;
  flag = 1;
  line = substr(line,pos + 3);
end;
pos = index(line,'</p');
if pos
then do;
  line = substr(line,1,pos - 1);
  output;
  flag = 0;
end;
else if flag then output;
drop flag pos;
datalines;
some uninteresting text
<p>Our
universal abuse of natural resources has created an imbalance in nature, contributing to the beginning of extinction. Over the last 100 years over 500 species have already gone extinct.
If we do not act now many more animals face extinction over the next 30 years including Orangutans, Rhinos, Polar Bears, Gorillas, Gibbons, Chimpanzees to name a few.</p>
xxx<p>this is a test</p>yyy
more uninteresting text
xxx<p>another
test</p>zzz
even more uninteresting text
;

 

kaziumair
Quartz | Level 8
Thanks a lot, it worked
Ksharp
Super User

Then remove these html tags .

 

data want;
infile cards truncover;
input line $200.;
want=prxchange('s/<.+?>//o',-1,line);
datalines;
<p>Our
universal abuse of natural resources has created an imbalance in nature, contributing to the beginning of extinction. Over the last 100 years over 500 species have already gone extinct.
If we do not act now many more animals face extinction over the next 30 years including Orangutans, Rhinos, Polar Bears, Gorillas, Gibbons, Chimpanzees to name a few.</p>
;

proc print;run;
kaziumair
Quartz | Level 8
Hi , thanks for taking out the time to help .

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 6 replies
  • 724 views
  • 0 likes
  • 3 in conversation