My sas table contains two columns: id and my_text . Each observation of my_text variable is a complete html string, something like this
<html> . . .
<head> . . .
<meta name="generator" content="HTML Tidy, see www.w3.org" />
<table style="WIDTH: 360.0pt;BORDER-COLLAPSE: collapse;"
border="0" cellspacing="0" cellpadding="0" width="480">
<tr style="HEIGHT: 15.0pt;"> <td style="BORDER-BOTTOM: rgb(236,233,216);
BORDER-LEFT: rgb(236,233,216); BACKGROUND-COLOR: transparent;
WIDTH: 360.0pt;HEIGHT: 15.0pt; " width="480">
So it contains <table> <tr <td . . .
and other complex html structures. How to parse this kind of html
into plain text ?
____________________________________________________________________________________________________________________
I need to get a SAS table like this
id | my_text | plain_text |
---|---|---|
1 | <html> . . . <head> . . . <meta blah ... blah ... blah ... | ONLY the "blah ... blah ... blah ... " part of my_text |
2 | ... | ... |
____________________________________________________________________________________________________________________
PS I was looking everywhere online for a good parsing code, however all the example are very trivial. The following PERL expression works fine only for 5 bytes. So this approach is okay for very simple tags .In my case it is useless. Please help
rx1=prxparse("s/<.*?>//");
call prxchange(rx1,99,my_text);
"PERL expression works fine only for 5 bytes."
What do you mean can't work ? prxchange can't do that ? or did you try call prxnext() ?
Thank you. I have a sample program . Here is what I tried
data t;
text1='<html> <head> <meta name=''generator'' content=''HTML Tidy, see www.w3.org'' />
<title></title> </head> <body> <p>Test</p> <p></p> <table style=''WIDTH:
360.0pt;BORDER-COLLAPSE: collapse;'' border=''0'' cellspacing=''0'' cellpadding=''0'' width=''480''>'
;
regex = prxparse('s/<\s+.*?>/ /');
call prxchange(regex,-1,text1);
put text1;
run;
But it did not work
How about this :
data t;
text1='<html> <head> <meta name=''generator'' content=''HTML Tidy, see www.w3.org'' />
<title></title> </head> <body> <p>Test</p> <p></p> <table style=''WIDTH:
360.0pt;BORDER-COLLAPSE: collapse;'' border=''0'' cellspacing=''0'' cellpadding=''0'' width=''480''>'
;
text2=prxchange('s/<[^<>]*>//',-1,text1);
run;
Xia Keshan
Have you tried opening with Excel and saving the output. Excel has a pretty decent XML/HTML parser present, and then you could save the output in delimited format to read into SAS. Its dependant on your HTML data of course.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.