We have a requirement to display lots of formatted text in pdf. The data is stored in a database text field, which contains html tags.
Feeling very brave, I have started coding a data step that "parses" the html and turns it into inline style definition statements. It's pretty rudimentary, and a bit inexact (obviously it would be impossible to exactly replicate html in the cell of a table in pdf document), but so far it works okay. Until there are nested <ul>s...
The data step has a bunch of tranwrd statements that turn, for example, '</p>'s into '^n's.
Has anyone else done anything like this? Any thoughts from the wizened and sage Cynthia, or others?
Does there need to be an error or problem? I want to know whether anyone else has had to do something similar to what I've described, and how they've gone about it, what tips they may have, what to watch out for.
That line goes on to say (semantically) "Until there are nested ul tags." I've actually put the tag (following the directions in the post you linked to) in but perhaps it is not displaying properly at your end.
In any case, I am still tweaking code and figuring out how to represent the formats indicated by the tags, and it was at that point that it occurred to me that there may be some participants in this discussion forum who may have done similar things.
I appreciate your enthusiasm to help me with errors/problems/things that aren't working. But this post is at the discussion rather than the troubleshooting end of the spectrum.
Scott posted a very good link that explains why < and > cannot always be posted without being changed into < and &gt; symbols (the forum message window seems to accept a -few- bracket symbols, but has particular difficulty with those). In addition, to post code, if you were going to post code, you need to use the [pre] and [/pre] tags around your code.
Since you have <P> tags in your HTML, it sounds like you have a document model, rather than a data table model of HTML. I would be very tempted to not parse the HTML with SAS, but instead to open the HTML file with Word (which reads HTML ever since Office 97) and then either convert to PDF with Word (possible in Word 2007) or directly distill HTML to PDF using Adobe or some 3rd party converters.
If there are tables in the HTML page that you need to convert to SAS data sets, then that might be a slightly stronger reason for reading the HTML file with SAS. I don't actually understand why or how you would convert a <P> tag into an inline style (unless you were dealing with HTML 2.0 or HTML 3.2 style of tags that contained embedded font information.)
If you had a LOT of HTML files that needed to be converted to PDF, I would be very tempted to talk to your web folks. If they don't already have an HTML to PDF converter in their arsenal, there are many free, or very low cost HTML-to-PDF converter applications that can be invoked in "batch" mode via a Perl script or a command file. If you Google HTML to PDF converter batch or just HTML to PDF converter
there are lots and lots of hits on programs that will convert HTML to PDF.
I have a hard time imagining nested <ul> tags INSIDE a table cell-- an outline of some kind??? I can't imagine much use for nested, bulleted unordered lists inside a table cell??? And even when I do imagine that some kind of bulleted list -- it still sounds more like a document to me, than data that you need to parse with SAS.
On the other hand, while I can imagine that you -could- parse the nested tags, it will take a LOT of counter variables and a LOT of trial and error. HTML can be very messy -- people can have <P> tags without closing </P> tags. There might not be any white space to give you a hint about where things stop and start. Tables don't always end -- early HTML didn't always conform to the specs and browsers were very forgiving of mismatched tags and missing closing tags. You will have to program for all of that (end one tag logically, when you find the beginning of the next tag).
I'll admit to a certain amount of curiosity about what's in the original HTML files and how it needs to be translated to an in-line style (and why). That's my .02 on what you need/want to do. Without more information or some sample of the original "data" (meaning what the HTML looks like), it's only speculation about whether what you're trying will work or is coded in the best way possible.
I'm flattered that you consider me to be "sage", and, while I may have seen the first Captain Kangaroo TV show...I'm not wizened!
I am inspired to provide a bit more context about what has created this scenario.
First, the content that we are dealing with here is narrative commentary that belongs with performance results. People provide this commentary by typing into a text box on a web form, and while doing so they may make use of a range of formatting options such as paragraphs, numbering, bullets, styling text etc. Once they provide this information and click save, their narrative is saved to a database field with html tags to preserve the formatting that they provided. Hence, it is "data" rather than web pages that I am parsing.
Once the data is stored, it gets presented in any of a number of ways. First, it may be displayed as part of a web page that presents information for a particular KPI. In this case, it's as simple as retrieving the content of that field into the right place in a web page, or in a frame on a web page. Similarly, it may be dumped out to be part of an email message, in which case again, the email is made an html document type, and it slides in nicely. Finally, there are a lot of pdf reports that present back this information, and that's where I'm up to now. The pdfs are generated on the fly using stored processes, and may be filtered based on user preferences. In these pdf documents, usually a table of KPIs is presented, and the commentary is included as the contents of a cell for the relevant KPI row for which it belongs.
Obviously the html tags can't be shown in the cell, nor can we just strip out the tags and leave the unformatted text. So, I am working on creating an approximation of the html formatting using inline style statements. There are a limited number of formatting tags that can be used, so I'm not trying to reproduce whole pages of html for example with positioned images and so forth, just those relating to text content. As for the P tags, perhaps the text input box conversion is putting them in when it ought not to. I'll look into that.
I tried including some demo code with datalines containing text with tags, but of course it all gets mangled by the markup for this forum. Is there some way I can send demo code directly to you?
An interesting exercise -- allowing folks to type and use formatting cosmetics in a text box and then keep those HTML-based formatting instructions intact or converting them to PDF formatting.
And you're using the BI Platform and Stored Processes. No small task, there!
All I can recommend is that you work with Tech Support. When you work on the Platform and with stored processes, depending on your client application, some clients use SAS style templates, some use CSS, some use style definitions that are defined in the XML that's used by Tomcat. So before you go too far down this road, I would make sure that what you want to do can even be done in the context of a stored process and, if so, what client applications will support what you want to do with the conversion of this text box to other forms (email, PDF, etc).
As for including demo code. Scott very kindly posted a link that explains how to post code -and- HTML -and- have the indention and "code" maintained. If you plan to post HTML, you should convert all the < tags to < and all the > tags to >, as explained HERE (forum link posted here for your convenience): http://support.sas.com/forums/thread.jspa?messageID=27609毙
However, I think that you really should work with Tech Support for some guidance on whether what you want to do is even do-able in the context of the BI Platform and stored processes, rather than post a hunk o'HTML here or send it to me directly. I'm about to go out of town to teach, so will not have much time to respond to emails.