Compressing a long string to a numeric value

Reply
Occasional Contributor
Posts: 5

Compressing a long string to a numeric value

I can't find any built in functions that allow compression of text values, what i'm trying to do is reduce a llong string to a shorter length for a fixed width field on an output file.

Is this possible?

Super User
Posts: 5,256

Re: Compressing a long string to a numeric value

On an output file there are no data types. Without seeing any example, convert your filed to char, and then compress before export to the output file.

Data never sleeps
Occasional Contributor
Posts: 5

Re: Compressing a long string to a numeric value

an example - convert a string 'the quick brown fox jumped over the lazy dog' length of 44 to an number, 123456789, length of 8. Not sure if a compression algorithm exists which can do this, or if SAS has a way to do it...

Super Contributor
Posts: 644

Re: Compressing a long string to a numeric value

This sounds like you want to encode your long text a strings to shorter coded value.  Assuming your longer expressions are not all unique, you could get a list of all unique values ready to pop into proc format

Proc SQL ;

     Create table list1 as

          Select distinct long_expression as start

               ,     'xxxxxxxx' as label

               ,     'exprfmt' as fmtname

               ,     'C' as type

          from have

          order by 1

     run ;

Quit ;

/* update the labels with numerical values */

data list2 ;

     set list1 ;

     label = put(_N_, z8.) ;

run ;

/* create the format */

Proc format cntlin = list2 ;

run ;

/* use the format to encode your data */

Data want ;

     set have ;

     encoded_value = put(long_expression, $exprfmt.) ;

     drop long_expression ;

run ;

/* now output your data */

/* also output start and label values to give you a lookup table for the original expression */

Richard

Super User
Posts: 5,256

Re: Compressing a long string to a numeric value

Maybe you could play a little with the md5() function. Using the $hex. format on the output it seems creates a string that is always 40 positions, so it depends on your input data if you will benefit from this...?

Data never sleeps
Frequent Contributor
Posts: 114

Re: Compressing a long string to a numeric value

You should take into consideration, that each Character (or Sting) on your Computer is merely a representation of numbers on data level.

High Level: Character/String

Data Level: Numbers with different lengths according to the Character encoding

Deep Down in your system: lots of 0 and 1. Which refers to a magnetic or non-magnetic state of your hard-drive (same logic but different physics applies to Flash drives, etc.)

Let's say the letter 'a' is represented by '1', 'b' is represented by '2' and so on (you may want to read about Unicode or ASCII to find out the actual representations).

So the string that you see as 'abcde fgh' in fact is saved as '123450678'. (if we assume that '0' stands for blank)

I do not understand how it would be possible to reduce a number to an even shorter number without losing any information.

What you could try to do is reduce the length of the string by cutting off some parts at the end or you could produce your output and zip it.

Also - as suggested by - you might use formats if your strings are repetitive.

The choice of method depends on the task at hand.

Please respond and provide more details, if you need further assistance.

Cheers,

Michael

Occasional Contributor
Posts: 5

Re: Compressing a long string to a numeric value

Thanks all but i think what i need is some kind of lossy compression algorithm, but nothing like this seems to exist in SAS at the variable level. Probably one for my spare time, if i ever get any...

Frequent Contributor
Posts: 114

Re: Compressing a long string to a numeric value

Well, you could still cut the string and thereby reduce it in a very lossy fashion.

On the other hand, if you are using SPDS for data storage, there is an option you could use to compress data sets.

One might not think, that in these days disc space would be such an issue. :-)

Occasional Contributor
Posts: 5

Re: Compressing a long string to a numeric value

Yes of course truncation is an option!

but i'm looking for a more systemtatic loss and of redundancies in the string rather than just what comes after the length limit, hence a compression algorith suited to strings

It's not about disk space actually, nor compression at the ds level.

Frequent Contributor
Posts: 114

Re: Compressing a long string to a numeric value

I see, well you could still give the Hufmann Coding a try, but I fear, your would need to implement it yourself:

What is Huffman coding? | Tektronix

http://www.cprogramming.com/tutorial/computersciencetheory/huffman.html

On the other hand, the Compress-Option in SAS seems to work accordingly, but then again, I am not too familiar with the details.

SAS(R) 9.2 Language Reference: Dictionary, Fourth Edition

Let us know, if you have found a suitable solution :-)

Ask a Question
Discussion stats
  • 9 replies
  • 546 views
  • 3 likes
  • 4 in conversation