BookmarkSubscribeRSS Feed
piton
Calcite | Level 5

My sas table contains two columns: id and my_text . Each observation of my_text variable is a complete html string, something like this

<html> . . .

  <head> . . .

  <meta name="generator" content="HTML Tidy, see www.w3.org" />

  <table style="WIDTH: 360.0pt;BORDER-COLLAPSE: collapse;"

         border="0" cellspacing="0" cellpadding="0" width="480">

  <tr style="HEIGHT: 15.0pt;"> <td style="BORDER-BOTTOM: rgb(236,233,216);

         BORDER-LEFT: rgb(236,233,216); BACKGROUND-COLOR: transparent;

         WIDTH: 360.0pt;HEIGHT: 15.0pt; " width="480">

So it contains <table> <tr <td . . . and other complex html structures. How to parse this kind of htmlinto plain text ?

____________________________________________________________________________________________________________________

I need to get a SAS table like this

idmy_textplain_text
1

<html> . . .

  <head> . . .

  <meta

blah ... blah ... blah ...

ONLY the "blah ... blah

... blah ... " part of my_text

2......

____________________________________________________________________________________________________________________

PS I was looking everywhere online for a good parsing code, however all the example are very trivial. The following PERL expression works fine only for 5 bytes. So this approach is okay for very simple tags .In my case it is useless. Please help

rx1=prxparse("s/<.*?>//");

call prxchange(rx1,99,my_text);

4 REPLIES 4
Ksharp
Super User

"PERL expression works fine only for 5 bytes."

What do you mean can't work ? prxchange can't do that ? or did you try call prxnext() ?

piton
Calcite | Level 5

Thank you. I have a sample program . Here is what I tried

data t;
text1
='<html> <head> <meta name=''generator'' content=''HTML Tidy, see www.w3.org'' />
<title></title> </head> <body> <p>Test</p> <p></p> <table style=''WIDTH:
360.0pt;BORDER-COLLAPSE: collapse;'' border=''0'' cellspacing=''0'' cellpadding=''0'' width=''480''>'

;
regex
= prxparse('s/<\s+.*?>/ /');
call prxchange
(regex,-1,text1);
put text1
;
run
;

But it did not work

Ksharp
Super User

How about this :

data t;

text1='<html> <head> <meta name=''generator'' content=''HTML Tidy, see www.w3.org'' />

<title></title> </head> <body> <p>Test</p> <p></p> <table style=''WIDTH:

360.0pt;BORDER-COLLAPSE: collapse;'' border=''0'' cellspacing=''0'' cellpadding=''0'' width=''480''>'

;

text2=prxchange('s/<[^<>]*>//',-1,text1);

run;

Xia Keshan

RW9
Diamond | Level 26 RW9
Diamond | Level 26

Have you tried opening with Excel and saving the output.  Excel has a pretty decent XML/HTML parser present, and then you could save the output in delimited format to read into SAS.  Its dependant on your HTML data of course.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 4 replies
  • 1485 views
  • 6 likes
  • 3 in conversation