DATA Step, Macro, Functions and more

How to parse html?

Reply
New Contributor
Posts: 2

How to parse html?

My sas table contains two columns: id and my_text . Each observation of my_text variable is a complete html string, something like this

<html> . . .

  <head> . . .

  <meta name="generator" content="HTML Tidy, see www.w3.org" />

  <table style="WIDTH: 360.0pt;BORDER-COLLAPSE: collapse;"

         border="0" cellspacing="0" cellpadding="0" width="480">

  <tr style="HEIGHT: 15.0pt;"> <td style="BORDER-BOTTOM: rgb(236,233,216);

         BORDER-LEFT: rgb(236,233,216); BACKGROUND-COLOR: transparent;

         WIDTH: 360.0pt;HEIGHT: 15.0pt; " width="480">

So it contains <table> <tr <td . . . and other complex html structures. How to parse this kind of htmlinto plain text ?

____________________________________________________________________________________________________________________

I need to get a SAS table like this

idmy_textplain_text
1

<html> . . .

  <head> . . .

  <meta

blah ... blah ... blah ...

ONLY the "blah ... blah

... blah ... " part of my_text

2......

____________________________________________________________________________________________________________________

PS I was looking everywhere online for a good parsing code, however all the example are very trivial. The following PERL expression works fine only for 5 bytes. So this approach is okay for very simple tags .In my case it is useless. Please help

rx1=prxparse("s/<.*?>//");

call prxchange(rx1,99,my_text);

Super User
Posts: 9,682

Re: How to parse html?

"PERL expression works fine only for 5 bytes."

What do you mean can't work ? prxchange can't do that ? or did you try call prxnext() ?

New Contributor
Posts: 2

Re: How to parse html?

Thank you. I have a sample program . Here is what I tried

data t;
text1
='<html> <head> <meta name=''generator'' content=''HTML Tidy, see www.w3.org'' />
<title></title> </head> <body> <p>Test</p> <p></p> <table style=''WIDTH:
360.0pt;BORDER-COLLAPSE: collapse;'' border=''0'' cellspacing=''0'' cellpadding=''0'' width=''480''>'

;
regex
= prxparse('s/<\s+.*?>/ /');
call prxchange
(regex,-1,text1);
put text1
;
run
;

But it did not work

Super User
Posts: 9,682

Re: How to parse html?

How about this :

data t;

text1='<html> <head> <meta name=''generator'' content=''HTML Tidy, see www.w3.org'' />

<title></title> </head> <body> <p>Test</p> <p></p> <table style=''WIDTH:

360.0pt;BORDER-COLLAPSE: collapse;'' border=''0'' cellspacing=''0'' cellpadding=''0'' width=''480''>'

;

text2=prxchange('s/<[^<>]*>//',-1,text1);

run;

Xia Keshan

Super User
Super User
Posts: 7,408

Re: How to parse html?

Have you tried opening with Excel and saving the output.  Excel has a pretty decent XML/HTML parser present, and then you could save the output in delimited format to read into SAS.  Its dependant on your HTML data of course.

Ask a Question
Discussion stats
  • 4 replies
  • 316 views
  • 6 likes
  • 3 in conversation