Hi , everyone I am trying to extract an article from a website , the web page has the paragraphs in the following structure:
<p>Our
universal abuse of natural resources has created an imbalance in nature, contributing to the beginning of extinction.
Over the last 100 years over 500 species have already gone extinct.
If we do not act now many more animals face extinction over the next 30 years including Orangutans, Rhinos, Polar Bears, Gorillas, Gibbons, Chimpanzees to name a few.</p>
How do I extract the content within the html tags?
I modified the code, now it can also deal with lines where <p> is not at position 1 and </p> not at the end of the line, or where both appear on one line (it does not handle cases where more than one paragraph appears on one input line):
data want;
infile datalines truncover;
retain flag 0;
input line $200.;
pos = index(line,'<p>');
if pos
then do;
flag = 1;
line = substr(line,pos + 3);
end;
pos = index(line,'</p');
if pos
then do;
line = substr(line,1,pos - 1);
output;
flag = 0;
end;
else if flag then output;
drop flag pos;
datalines;
some uninteresting text
<p>Our
universal abuse of natural resources has created an imbalance in nature, contributing to the beginning of extinction. Over the last 100 years over 500 species have already gone extinct.
If we do not act now many more animals face extinction over the next 30 years including Orangutans, Rhinos, Polar Bears, Gorillas, Gibbons, Chimpanzees to name a few.</p>
xxx<p>this is a test</p>yyy
more uninteresting text
xxx<p>another
test</p>zzz
even more uninteresting text
;
If the tags are always on an individual line, it's easy:
data want;
infile datalines truncover;
retain flag 0;
input line $200.;
if line = "<p>" then flag = 1;
else do;
if line = "</p>" then flag = 0;
if flag then output;
end;
drop flag;
datalines;
<p>
Our universal abuse of natural resources has created an imbalance in nature, contributing to the beginning of extinction. Over the last 100 years over 500 species have already gone extinct.
If we do not act now many more animals face extinction over the next 30 years including Orangutans, Rhinos, Polar Bears, Gorillas, Gibbons, Chimpanzees to name a few.
</p>
;
I modified the code, now it can also deal with lines where <p> is not at position 1 and </p> not at the end of the line, or where both appear on one line (it does not handle cases where more than one paragraph appears on one input line):
data want;
infile datalines truncover;
retain flag 0;
input line $200.;
pos = index(line,'<p>');
if pos
then do;
flag = 1;
line = substr(line,pos + 3);
end;
pos = index(line,'</p');
if pos
then do;
line = substr(line,1,pos - 1);
output;
flag = 0;
end;
else if flag then output;
drop flag pos;
datalines;
some uninteresting text
<p>Our
universal abuse of natural resources has created an imbalance in nature, contributing to the beginning of extinction. Over the last 100 years over 500 species have already gone extinct.
If we do not act now many more animals face extinction over the next 30 years including Orangutans, Rhinos, Polar Bears, Gorillas, Gibbons, Chimpanzees to name a few.</p>
xxx<p>this is a test</p>yyy
more uninteresting text
xxx<p>another
test</p>zzz
even more uninteresting text
;
Then remove these html tags .
data want; infile cards truncover; input line $200.; want=prxchange('s/<.+?>//o',-1,line); datalines; <p>Our universal abuse of natural resources has created an imbalance in nature, contributing to the beginning of extinction. Over the last 100 years over 500 species have already gone extinct. If we do not act now many more animals face extinction over the next 30 years including Orangutans, Rhinos, Polar Bears, Gorillas, Gibbons, Chimpanzees to name a few.</p> ; proc print;run;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.