Hi all
I am working on SAS 9.4 M3 (SYSVLONG4 = 9.04.01M3P06242015) and have encountered input data formatted with RTF formatting.
It has quite a few structures as it can origin through several different channels. So there are for me to see no quick and dirty fixes.
My data can look something like this:
{\rtf1\fbidis\ansi\deff0{\fonttbl{\f0\fswiss\fprq2\fcharset0 Arial;}{\f1\fswiss\fprq2\fcharset0 Calibri;}}
{\colortbl ;\red0\green0\blue0;}
\viewkind4\uc1 d\ltrpar\cf1\lang1030\f0\fs22 *REAL TEXT*
d\ltrpar\sa200\sl276\slmult1\cf0\f1 *REAL TEXT*
d\ltrpar\cf1\f0
}
Where *REAL TEXT* indicates what I am really interested in.
Are any of you familiar with a SAS function (eg user written) or SAS MACRO that can actually do this stripping of RTF-formating?
Best,
Sander Ehmsen, Denmark.
You really are not going to get anywhere with this I am afraid. There are no simple methods to parsing an rtf file into something usable. Many have tried, I have tried, all have got some ways and given up.
The file itself is a markup language, and there are loads of tags:
https://www.microsoft.com/en-us/download/details.aspx?id=10725
That is the latest spec. Now you could write a parser, take each tag, find closing tag (if there is one), and perl may help somewhat. But it is a big undertaking. Even output generated from SAS which is pretty low in terms of rtf, can be very different between different systems and such like.
I have also looked as well at what @Kurt_Bremser mentioned, using another program to convert to another file format. And there are ways to get html, or text output. However even that, unless its a very simple file, really isn't much help. Tabular output for instance - which a lot of SAS output is - doesn't have any indication of position. RTF is literally one page at a time, cell by cell. So you first need to parse the header blocks, then do page by page, extract the information, then set it all together.
I would go back to the source data, it is the best, and with limited time, the only feasible method.
RTF is an output format, i would refuse to write a program that parses RTF.
It could be possible to remove the formatting by using some regular expressions, but i don't know enough about rtf to suggest something that actually does that job.
Does the interesting part always start after the first blank?
Depending on your environment, use VBA (Windows with MS Office) or shell scripting with OpenOffice (all open platforms) to load the rtf and save it as .txt.
Thank you for your suggestion.
Our SAS soon runs on a Linux platform. And my Data Custodians has refused to implement a RTF-parser on that platform.
So according to them it is not feasible.
Best,
Sander.
@SanderEhmsen wrote:
Thank you for your suggestion.
Our SAS soon runs on a Linux platform. And my Data Custodians has refused to implement a RTF-parser on that platform.
So according to them it is not feasible.
Best,
Sander.
Tell them to look up "Mordac the Preventer".
You really are not going to get anywhere with this I am afraid. There are no simple methods to parsing an rtf file into something usable. Many have tried, I have tried, all have got some ways and given up.
The file itself is a markup language, and there are loads of tags:
https://www.microsoft.com/en-us/download/details.aspx?id=10725
That is the latest spec. Now you could write a parser, take each tag, find closing tag (if there is one), and perl may help somewhat. But it is a big undertaking. Even output generated from SAS which is pretty low in terms of rtf, can be very different between different systems and such like.
I have also looked as well at what @Kurt_Bremser mentioned, using another program to convert to another file format. And there are ways to get html, or text output. However even that, unless its a very simple file, really isn't much help. Tabular output for instance - which a lot of SAS output is - doesn't have any indication of position. RTF is literally one page at a time, cell by cell. So you first need to parse the header blocks, then do page by page, extract the information, then set it all together.
I would go back to the source data, it is the best, and with limited time, the only feasible method.
Yes, they must have some raw data they used to generate the RTF, so that is the best method.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.