My sas table contains two columns: id and my_text . Each observation of my_text variable is a complete html string, something like this
<html> . . .
<head> . . .
<meta name="generator" content="HTML Tidy, see www.w3.org" />
<table style="WIDTH: 360.0pt;BORDER-COLLAPSE: collapse;"
border="0" cellspacing="0" cellpadding="0" width="480">
<tr style="HEIGHT: 15.0pt;"> <td style="BORDER-BOTTOM: rgb(236,233,216);
BORDER-LEFT: rgb(236,233,216); BACKGROUND-COLOR: transparent;
WIDTH: 360.0pt;HEIGHT: 15.0pt; " width="480">
So it contains <table> <tr <td . . .
and other complex html structures. How to parse this kind of html
into plain text ?
____________________________________________________________________________________________________________________
I need to get a SAS table like this
id | my_text | plain_text |
---|---|---|
1 | <html> . . . <head> . . . <meta blah ... blah ... blah ... | ONLY the "blah ... blah ... blah ... " part of my_text |
2 | ... | ... |
____________________________________________________________________________________________________________________
PS I was looking everywhere online for a good parsing code, however all the example are very trivial. The following PERL expression works fine only for 5 bytes. So this approach is okay for very simple tags .In my case it is useless. Please help
rx1=prxparse("s/<.*?>//");
call prxchange(rx1,99,my_text);
"PERL expression works fine only for 5 bytes."
What do you mean can't work ? prxchange can't do that ? or did you try call prxnext() ?
Thank you. I have a sample program . Here is what I tried
data t;
text1='<html> <head> <meta name=''generator'' content=''HTML Tidy, see www.w3.org'' />
<title></title> </head> <body> <p>Test</p> <p></p> <table style=''WIDTH:
360.0pt;BORDER-COLLAPSE: collapse;'' border=''0'' cellspacing=''0'' cellpadding=''0'' width=''480''>'
;
regex = prxparse('s/<\s+.*?>/ /');
call prxchange(regex,-1,text1);
put text1;
run;
But it did not work
How about this :
data t;
text1='<html> <head> <meta name=''generator'' content=''HTML Tidy, see www.w3.org'' />
<title></title> </head> <body> <p>Test</p> <p></p> <table style=''WIDTH:
360.0pt;BORDER-COLLAPSE: collapse;'' border=''0'' cellspacing=''0'' cellpadding=''0'' width=''480''>'
;
text2=prxchange('s/<[^<>]*>//',-1,text1);
run;
Xia Keshan
Have you tried opening with Excel and saving the output. Excel has a pretty decent XML/HTML parser present, and then you could save the output in delimited format to read into SAS. Its dependant on your HTML data of course.
Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.