- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi all
I am working on SAS 9.4 M3 (SYSVLONG4 = 9.04.01M3P06242015) and have encountered input data formatted with RTF formatting.
It has quite a few structures as it can origin through several different channels. So there are for me to see no quick and dirty fixes.
My data can look something like this:
{\rtf1\fbidis\ansi\deff0{\fonttbl{\f0\fswiss\fprq2\fcharset0 Arial;}{\f1\fswiss\fprq2\fcharset0 Calibri;}}
{\colortbl ;\red0\green0\blue0;}
\viewkind4\uc1 d\ltrpar\cf1\lang1030\f0\fs22 *REAL TEXT*
d\ltrpar\sa200\sl276\slmult1\cf0\f1 *REAL TEXT*
d\ltrpar\cf1\f0
}
Where *REAL TEXT* indicates what I am really interested in.
Are any of you familiar with a SAS function (eg user written) or SAS MACRO that can actually do this stripping of RTF-formating?
Best,
Sander Ehmsen, Denmark.
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
You really are not going to get anywhere with this I am afraid. There are no simple methods to parsing an rtf file into something usable. Many have tried, I have tried, all have got some ways and given up.
The file itself is a markup language, and there are loads of tags:
https://www.microsoft.com/en-us/download/details.aspx?id=10725
That is the latest spec. Now you could write a parser, take each tag, find closing tag (if there is one), and perl may help somewhat. But it is a big undertaking. Even output generated from SAS which is pretty low in terms of rtf, can be very different between different systems and such like.
I have also looked as well at what @Kurt_Bremser mentioned, using another program to convert to another file format. And there are ways to get html, or text output. However even that, unless its a very simple file, really isn't much help. Tabular output for instance - which a lot of SAS output is - doesn't have any indication of position. RTF is literally one page at a time, cell by cell. So you first need to parse the header blocks, then do page by page, extract the information, then set it all together.
I would go back to the source data, it is the best, and with limited time, the only feasible method.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
RTF is an output format, i would refuse to write a program that parses RTF.
It could be possible to remove the formatting by using some regular expressions, but i don't know enough about rtf to suggest something that actually does that job.
Does the interesting part always start after the first blank?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Depending on your environment, use VBA (Windows with MS Office) or shell scripting with OpenOffice (all open platforms) to load the rtf and save it as .txt.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thank you for your suggestion.
Our SAS soon runs on a Linux platform. And my Data Custodians has refused to implement a RTF-parser on that platform.
So according to them it is not feasible.
Best,
Sander.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@SanderEhmsen wrote:
Thank you for your suggestion.
Our SAS soon runs on a Linux platform. And my Data Custodians has refused to implement a RTF-parser on that platform.
So according to them it is not feasible.
Best,
Sander.
Tell them to look up "Mordac the Preventer".
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
You really are not going to get anywhere with this I am afraid. There are no simple methods to parsing an rtf file into something usable. Many have tried, I have tried, all have got some ways and given up.
The file itself is a markup language, and there are loads of tags:
https://www.microsoft.com/en-us/download/details.aspx?id=10725
That is the latest spec. Now you could write a parser, take each tag, find closing tag (if there is one), and perl may help somewhat. But it is a big undertaking. Even output generated from SAS which is pretty low in terms of rtf, can be very different between different systems and such like.
I have also looked as well at what @Kurt_Bremser mentioned, using another program to convert to another file format. And there are ways to get html, or text output. However even that, unless its a very simple file, really isn't much help. Tabular output for instance - which a lot of SAS output is - doesn't have any indication of position. RTF is literally one page at a time, cell by cell. So you first need to parse the header blocks, then do page by page, extract the information, then set it all together.
I would go back to the source data, it is the best, and with limited time, the only feasible method.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I have contacted my data provider. And maybe they can strip it in their end.
I might do some tranwrd() and remove the most common RTF code. It will not get all the way. But it might be better for my end users.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Yes, they must have some raw data they used to generate the RTF, so that is the best method.