09-05-2013 04:42 AM
I can't find any built in functions that allow compression of text values, what i'm trying to do is reduce a llong string to a shorter length for a fixed width field on an output file.
Is this possible?
09-05-2013 05:02 AM
On an output file there are no data types. Without seeing any example, convert your filed to char, and then compress before export to the output file.
09-05-2013 05:07 AM
an example - convert a string 'the quick brown fox jumped over the lazy dog' length of 44 to an number, 123456789, length of 8. Not sure if a compression algorithm exists which can do this, or if SAS has a way to do it...
09-05-2013 07:30 AM
This sounds like you want to encode your long text a strings to shorter coded value. Assuming your longer expressions are not all unique, you could get a list of all unique values ready to pop into proc format
Proc SQL ;
Create table list1 as
Select distinct long_expression as start
, 'xxxxxxxx' as label
, 'exprfmt' as fmtname
, 'C' as type
order by 1
/* update the labels with numerical values */
data list2 ;
set list1 ;
label = put(_N_, z8.) ;
/* create the format */
Proc format cntlin = list2 ;
/* use the format to encode your data */
Data want ;
set have ;
encoded_value = put(long_expression, $exprfmt.) ;
drop long_expression ;
/* now output your data */
/* also output start and label values to give you a lookup table for the original expression */
09-05-2013 07:40 AM
Maybe you could play a little with the md5() function. Using the $hex. format on the output it seems creates a string that is always 40 positions, so it depends on your input data if you will benefit from this...?
09-09-2013 05:40 AM
You should take into consideration, that each Character (or Sting) on your Computer is merely a representation of numbers on data level.
High Level: Character/String
Data Level: Numbers with different lengths according to the Character encoding
Deep Down in your system: lots of 0 and 1. Which refers to a magnetic or non-magnetic state of your hard-drive (same logic but different physics applies to Flash drives, etc.)
Let's say the letter 'a' is represented by '1', 'b' is represented by '2' and so on (you may want to read about Unicode or ASCII to find out the actual representations).
So the string that you see as 'abcde fgh' in fact is saved as '123450678'. (if we assume that '0' stands for blank)
I do not understand how it would be possible to reduce a number to an even shorter number without losing any information.
What you could try to do is reduce the length of the string by cutting off some parts at the end or you could produce your output and zip it.
Also - as suggested by RichardinOz - you might use formats if your strings are repetitive.
The choice of method depends on the task at hand.
Please respond and provide more details, if you need further assistance.
09-10-2013 05:45 AM
Thanks all but i think what i need is some kind of lossy compression algorithm, but nothing like this seems to exist in SAS at the variable level. Probably one for my spare time, if i ever get any...
09-10-2013 07:50 AM
Well, you could still cut the string and thereby reduce it in a very lossy fashion.
On the other hand, if you are using SPDS for data storage, there is an option you could use to compress data sets.
One might not think, that in these days disc space would be such an issue. :-)
09-10-2013 09:11 AM
Yes of course truncation is an option!
but i'm looking for a more systemtatic loss and of redundancies in the string rather than just what comes after the length limit, hence a compression algorith suited to strings
It's not about disk space actually, nor compression at the ds level.
09-10-2013 09:27 AM
I see, well you could still give the Hufmann Coding a try, but I fear, your would need to implement it yourself:
On the other hand, the Compress-Option in SAS seems to work accordingly, but then again, I am not too familiar with the details.
Let us know, if you have found a suitable solution :-)