My sas table contains two columns: id and my_text . Each observation of my_text variable is a complete html string, something like this
<html> . . .
<head> . . .
<meta name="generator" content="HTML Tidy, see www.w3.org" />
<table style="WIDTH: 360.0pt;BORDER-COLLAPSE: collapse;"
border="0" cellspacing="0" cellpadding="0" width="480">
<tr style="HEIGHT: 15.0pt;"> <td style="BORDER-BOTTOM: rgb(236,233,216);
BORDER-LEFT: rgb(236,233,216); BACKGROUND-COLOR: transparent;
WIDTH: 360.0pt;HEIGHT: 15.0pt; " width="480">
So it contains <table> <tr <td . . . and other complex html structures. How to parse this kind of htmlinto plain text ?
____________________________________________________________________________________________________________________
I need to get a SAS table like this
| id | my_text | plain_text |
|---|---|---|
| 1 | <html> . . . <head> . . . <meta blah ... blah ... blah ... | ONLY the "blah ... blah ... blah ... " part of my_text |
| 2 | ... | ... |
____________________________________________________________________________________________________________________
PS I was looking everywhere online for a good parsing code, however all the example are very trivial. The following PERL expression works fine only for 5 bytes. So this approach is okay for very simple tags .In my case it is useless. Please help
rx1=prxparse("s/<.*?>//");
call prxchange(rx1,99,my_text);
"PERL expression works fine only for 5 bytes."
What do you mean can't work ? prxchange can't do that ? or did you try call prxnext() ?
Thank you. I have a sample program . Here is what I tried
data t;
text1='<html> <head> <meta name=''generator'' content=''HTML Tidy, see www.w3.org'' />
<title></title> </head> <body> <p>Test</p> <p></p> <table style=''WIDTH:
360.0pt;BORDER-COLLAPSE: collapse;'' border=''0'' cellspacing=''0'' cellpadding=''0'' width=''480''>'
;
regex = prxparse('s/<\s+.*?>/ /');
call prxchange(regex,-1,text1);
put text1;
run;
But it did not work
How about this :
data t;
text1='<html> <head> <meta name=''generator'' content=''HTML Tidy, see www.w3.org'' />
<title></title> </head> <body> <p>Test</p> <p></p> <table style=''WIDTH:
360.0pt;BORDER-COLLAPSE: collapse;'' border=''0'' cellspacing=''0'' cellpadding=''0'' width=''480''>'
;
text2=prxchange('s/<[^<>]*>//',-1,text1);
run;
Xia Keshan
Have you tried opening with Excel and saving the output. Excel has a pretty decent XML/HTML parser present, and then you could save the output in delimited format to read into SAS. Its dependant on your HTML data of course.
April 27 – 30 | Gaylord Texan | Grapevine, Texas
Walk in ready to learn. Walk out ready to deliver. This is the data and AI conference you can't afford to miss.
Register now and save with the early bird rate—just $795!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.