My sas table contains two columns: id and my_text . Each observation of my_text variable is a complete html string, something like this
<html> . . .
<head> . . .
<meta name="generator" content="HTML Tidy, see www.w3.org" />
<table style="WIDTH: 360.0pt;BORDER-COLLAPSE: collapse;"
border="0" cellspacing="0" cellpadding="0" width="480">
<tr style="HEIGHT: 15.0pt;"> <td style="BORDER-BOTTOM: rgb(236,233,216);
BORDER-LEFT: rgb(236,233,216); BACKGROUND-COLOR: transparent;
WIDTH: 360.0pt;HEIGHT: 15.0pt; " width="480">
So it contains <table> <tr <td . . .
and other complex html structures. How to parse this kind of html
into plain text ?
____________________________________________________________________________________________________________________
I need to get a SAS table like this
id | my_text | plain_text |
---|---|---|
1 | <html> . . . <head> . . . <meta blah ... blah ... blah ... | ONLY the "blah ... blah ... blah ... " part of my_text |
2 | ... | ... |
____________________________________________________________________________________________________________________
PS I was looking everywhere online for a good parsing code, however all the example are very trivial. The following PERL expression works fine only for 5 bytes. So this approach is okay for very simple tags .In my case it is useless. Please help
rx1=prxparse("s/<.*?>//");
call prxchange(rx1,99,my_text);
"PERL expression works fine only for 5 bytes."
What do you mean can't work ? prxchange can't do that ? or did you try call prxnext() ?
Thank you. I have a sample program . Here is what I tried
data t;
text1='<html> <head> <meta name=''generator'' content=''HTML Tidy, see www.w3.org'' />
<title></title> </head> <body> <p>Test</p> <p></p> <table style=''WIDTH:
360.0pt;BORDER-COLLAPSE: collapse;'' border=''0'' cellspacing=''0'' cellpadding=''0'' width=''480''>'
;
regex = prxparse('s/<\s+.*?>/ /');
call prxchange(regex,-1,text1);
put text1;
run;
But it did not work
How about this :
data t;
text1='<html> <head> <meta name=''generator'' content=''HTML Tidy, see www.w3.org'' />
<title></title> </head> <body> <p>Test</p> <p></p> <table style=''WIDTH:
360.0pt;BORDER-COLLAPSE: collapse;'' border=''0'' cellspacing=''0'' cellpadding=''0'' width=''480''>'
;
text2=prxchange('s/<[^<>]*>//',-1,text1);
run;
Xia Keshan
Have you tried opening with Excel and saving the output. Excel has a pretty decent XML/HTML parser present, and then you could save the output in delimited format to read into SAS. Its dependant on your HTML data of course.
Don't miss out on SAS Innovate - Register now for the FREE Livestream!
Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.