BookmarkSubscribeRSS Feed
piton
Calcite | Level 5

My sas table contains two columns: id and my_text . Each observation of my_text variable is a complete html string, something like this

<html> . . .

  <head> . . .

  <meta name="generator" content="HTML Tidy, see www.w3.org" />

  <table style="WIDTH: 360.0pt;BORDER-COLLAPSE: collapse;"

         border="0" cellspacing="0" cellpadding="0" width="480">

  <tr style="HEIGHT: 15.0pt;"> <td style="BORDER-BOTTOM: rgb(236,233,216);

         BORDER-LEFT: rgb(236,233,216); BACKGROUND-COLOR: transparent;

         WIDTH: 360.0pt;HEIGHT: 15.0pt; " width="480">

So it contains <table> <tr <td . . . and other complex html structures. How to parse this kind of htmlinto plain text ?

____________________________________________________________________________________________________________________

I need to get a SAS table like this

idmy_textplain_text
1

<html> . . .

  <head> . . .

  <meta

blah ... blah ... blah ...

ONLY the "blah ... blah

... blah ... " part of my_text

2......

____________________________________________________________________________________________________________________

PS I was looking everywhere online for a good parsing code, however all the example are very trivial. The following PERL expression works fine only for 5 bytes. So this approach is okay for very simple tags .In my case it is useless. Please help

rx1=prxparse("s/<.*?>//");

call prxchange(rx1,99,my_text);

4 REPLIES 4
Ksharp
Super User

"PERL expression works fine only for 5 bytes."

What do you mean can't work ? prxchange can't do that ? or did you try call prxnext() ?

piton
Calcite | Level 5

Thank you. I have a sample program . Here is what I tried

data t;
text1
='<html> <head> <meta name=''generator'' content=''HTML Tidy, see www.w3.org'' />
<title></title> </head> <body> <p>Test</p> <p></p> <table style=''WIDTH:
360.0pt;BORDER-COLLAPSE: collapse;'' border=''0'' cellspacing=''0'' cellpadding=''0'' width=''480''>'

;
regex
= prxparse('s/<\s+.*?>/ /');
call prxchange
(regex,-1,text1);
put text1
;
run
;

But it did not work

Ksharp
Super User

How about this :

data t;

text1='<html> <head> <meta name=''generator'' content=''HTML Tidy, see www.w3.org'' />

<title></title> </head> <body> <p>Test</p> <p></p> <table style=''WIDTH:

360.0pt;BORDER-COLLAPSE: collapse;'' border=''0'' cellspacing=''0'' cellpadding=''0'' width=''480''>'

;

text2=prxchange('s/<[^<>]*>//',-1,text1);

run;

Xia Keshan

RW9
Diamond | Level 26 RW9
Diamond | Level 26

Have you tried opening with Excel and saving the output.  Excel has a pretty decent XML/HTML parser present, and then you could save the output in delimited format to read into SAS.  Its dependant on your HTML data of course.

hackathon24-white-horiz.png

2025 SAS Hackathon: There is still time!

Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!

Register Now

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 4 replies
  • 2318 views
  • 6 likes
  • 3 in conversation