DATA Step, Macro, Functions and more

How to remove RTF-formatting

Accepted Solution Solved
Reply
Contributor
Posts: 39
Accepted Solution

How to remove RTF-formatting

Hi all

 

I am working on SAS 9.4 M3 (SYSVLONG4 = 9.04.01M3P06242015) and have encountered input data formatted with RTF formatting. 

It has quite a few structures as it can origin through several different channels. So there are for me to see no quick and dirty fixes.

 

My data can look something like this:

{\rtf1\fbidis\ansi\deff0{\fonttbl{\f0\fswiss\fprq2\fcharset0 Arial;}{\f1\fswiss\fprq2\fcharset0 Calibri;}}
{\colortbl ;\red0\green0\blue0;}
\viewkind4\uc1 d\ltrpar\cf1\lang1030\f0\fs22 *REAL TEXT*
d\ltrpar\sa200\sl276\slmult1\cf0\f1 *REAL TEXT*
d\ltrpar\cf1\f0
}

 

Where *REAL TEXT* indicates what I am really interested in. 

Are any of you familiar with a SAS function (eg user written) or SAS MACRO that can actually do this stripping of RTF-formating? 

 

Best,

Sander Ehmsen, Denmark.


Accepted Solutions
Solution
3 weeks ago
Super User
Super User
Posts: 9,599

Re: How to remove RTF-formatting

Posted in reply to SanderEhmsen

You really are not going to get anywhere with this I am afraid.  There are no simple methods to parsing an rtf file into something usable.  Many have tried, I have tried, all have got some ways and given up.  

The file itself is a markup language, and there are loads of tags:

https://www.microsoft.com/en-us/download/details.aspx?id=10725

 

That is the latest spec.  Now you could write a parser, take each tag, find closing tag (if there is one), and perl may help somewhat.  But it is a big undertaking.  Even output generated from SAS which is pretty low in terms of rtf, can be very different between different systems and such like.

 

I have also looked as well at what @KurtBremser mentioned, using another program to convert to another file format.  And there are ways to get html, or text output.  However even that, unless its a very simple file, really isn't much help.  Tabular output for instance - which a lot of SAS output is - doesn't have any indication of position.  RTF is literally one page at a time, cell by cell.  So you first need to parse the header blocks, then do page by page, extract the information, then set it all together.

 

I would go back to the source data, it is the best, and with limited time, the only feasible method.

View solution in original post


All Replies
Valued Guide
Posts: 574

Re: How to remove RTF-formatting

Posted in reply to SanderEhmsen

RTF is an output format, i would refuse to write a program that parses RTF.

It could be possible to remove the formatting by using some regular expressions, but i don't know enough about rtf to suggest something that actually does that job.

 

Does the interesting part always start after the first blank?

Contributor
Posts: 39

Re: How to remove RTF-formatting

Posted in reply to andreas_lds
Refusing is probably not an acceptable solution here. I have tried to find manually find patterns in the RTF-code like finding the first blank. I can get something like 90% right by this method. But the last 10% ends up miserably :-).
Super User
Posts: 10,259

Re: How to remove RTF-formatting

Posted in reply to SanderEhmsen

Depending on your environment, use VBA (Windows with MS Office) or shell scripting with OpenOffice (all open platforms) to load the rtf and save it as .txt.

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
How to post code
Contributor
Posts: 39

Re: How to remove RTF-formatting

Posted in reply to KurtBremser

Thank you for your suggestion.

 

Our SAS soon runs on a Linux platform. And my Data Custodians has refused to implement a RTF-parser on that platform.

 

So according to them it is not feasible.

 

Best, 

Sander.

Super User
Posts: 10,259

Re: How to remove RTF-formatting

Posted in reply to SanderEhmsen

@SanderEhmsen wrote:

Thank you for your suggestion.

 

Our SAS soon runs on a Linux platform. And my Data Custodians has refused to implement a RTF-parser on that platform.

 

So according to them it is not feasible.

 

Best, 

Sander.


Tell them to look up "Mordac the Preventer".

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
How to post code
Solution
3 weeks ago
Super User
Super User
Posts: 9,599

Re: How to remove RTF-formatting

Posted in reply to SanderEhmsen

You really are not going to get anywhere with this I am afraid.  There are no simple methods to parsing an rtf file into something usable.  Many have tried, I have tried, all have got some ways and given up.  

The file itself is a markup language, and there are loads of tags:

https://www.microsoft.com/en-us/download/details.aspx?id=10725

 

That is the latest spec.  Now you could write a parser, take each tag, find closing tag (if there is one), and perl may help somewhat.  But it is a big undertaking.  Even output generated from SAS which is pretty low in terms of rtf, can be very different between different systems and such like.

 

I have also looked as well at what @KurtBremser mentioned, using another program to convert to another file format.  And there are ways to get html, or text output.  However even that, unless its a very simple file, really isn't much help.  Tabular output for instance - which a lot of SAS output is - doesn't have any indication of position.  RTF is literally one page at a time, cell by cell.  So you first need to parse the header blocks, then do page by page, extract the information, then set it all together.

 

I would go back to the source data, it is the best, and with limited time, the only feasible method.

Contributor
Posts: 39

Re: How to remove RTF-formatting

Thank you very much for your reply.

I have contacted my data provider. And maybe they can strip it in their end.

I might do some tranwrd() and remove the most common RTF code. It will not get all the way. But it might be better for my end users.
Super User
Super User
Posts: 9,599

Re: How to remove RTF-formatting

Posted in reply to SanderEhmsen

Yes, they must have some raw data they used to generate the RTF, so that is the best method.

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 8 replies
  • 98 views
  • 0 likes
  • 4 in conversation