DATA Step, Macro, Functions and more

Shorten long string, but ensure it remains unique

Reply
Contributor
Posts: 39

Shorten long string, but ensure it remains unique

Hello,

this one is a bit tricky to explain. Here is the original problem though:

I have a list of 10,000 observations. Each observation has a unique observation ID which is between 100 and 400 characters long:

OBSERVATION_IDVALUE
xxxxxxxCHARACTERS100100
xxxxxxxxxxCHARACTERS30050

xxxxxxxxxxxCHARACTERS400

25
(more observation IDs...)...

This data set now gets transposed and the observation IDs become the column names. The problem is that the observation_id's get truncated to 32 characters, because that is the maximum length for column names supported by SAS 9.4:

xxxxxxxCHARACTER100

xxxxxxxxxxCHARACTER3

xxxxxxxxxxxCHARACTER

1005025

One problem is that they may no longer be unique when truncated. They also can no longer be referenced by people querying the data. For instance, a person wants to be able to run a program that divides Observation 1 by Observation 2 and reference those two observations by their observation ID. This is no longer possible, because they would still enter the whole/original observation ID.

Possible Solution:

I thought about creating an md5 hash for each observation ID. It shortens them to 32 bytes. But is that a safe thing to do considering the source string is much longer than 32 bytes (up to 400 characters long)? Even if it is safe because of its very low collision probability, I wouldn't be able to work with those special characters MD5 generates. So I would have convert the hashed string using a hexadecimal format for instance. But wouldn't that tremendously increase the collision probability due to its low character range (I thinik 1-16 characters only)?

Thanks for your help!

Regards

Phil

Contributor
Posts: 53

Re: Shorten long string, but ensure it remains unique

Why do you need to convert this dataset from a long to a wide set?

Contributor
Posts: 39

Re: Shorten long string, but ensure it remains unique

Because the end users want to create simple files that hold formulae. For example: Observation 1 + Observation 2. I must load these formulae and execute them. This only works if the underlying observations are all in one row. Otherwise I have to use procedures and complex logic. Let's assume the design is a given.

Super User
Super User
Posts: 6,498

Re: Shorten long string, but ensure it remains unique

Just number them.

data middle ;

  set have ;

  by observation_id ;

  if first.observation_id then observation_number+1;

run;

proc transpose data=middle out=want prefix=obs ;

  id obervation_number ;

  var value ;

run;

Super User
Posts: 10,474

Re: Shorten long string, but ensure it remains unique

And assign the long text to the LABEL.

Contributor
Posts: 39

Re: Shorten long string, but ensure it remains unique

Tom, how is the end user going to be able to reference the observation id if I converted it to a number? He only has the actual observation id (the name), not the number which I would be assigning dynamically according to your code. Ballardw, can I reference the label like I would reference the column name when aggregating variables? I think the maximum label length is 256 characters, right? That'd still be too short.

Super User
Super User
Posts: 6,498

Re: Shorten long string, but ensure it remains unique

How were they going to reference the long strings?

Why exactly are you transposing the data in the first place?

Super User
Posts: 17,771

Re: Shorten long string, but ensure it remains unique

You're expecting a customer to reference a 256+ long string in formula's and not make mistakes?

Build/supply a lookup table to them.

Ask a Question
Discussion stats
  • 7 replies
  • 449 views
  • 1 like
  • 5 in conversation