BookmarkSubscribeRSS Feed
PhilfromGermany
Fluorite | Level 6

Hello,

this one is a bit tricky to explain. Here is the original problem though:

I have a list of 10,000 observations. Each observation has a unique observation ID which is between 100 and 400 characters long:

OBSERVATION_IDVALUE
xxxxxxxCHARACTERS100100
xxxxxxxxxxCHARACTERS30050

xxxxxxxxxxxCHARACTERS400

25
(more observation IDs...)...

This data set now gets transposed and the observation IDs become the column names. The problem is that the observation_id's get truncated to 32 characters, because that is the maximum length for column names supported by SAS 9.4:

xxxxxxxCHARACTER100

xxxxxxxxxxCHARACTER3

xxxxxxxxxxxCHARACTER

1005025

One problem is that they may no longer be unique when truncated. They also can no longer be referenced by people querying the data. For instance, a person wants to be able to run a program that divides Observation 1 by Observation 2 and reference those two observations by their observation ID. This is no longer possible, because they would still enter the whole/original observation ID.

Possible Solution:

I thought about creating an md5 hash for each observation ID. It shortens them to 32 bytes. But is that a safe thing to do considering the source string is much longer than 32 bytes (up to 400 characters long)? Even if it is safe because of its very low collision probability, I wouldn't be able to work with those special characters MD5 generates. So I would have convert the hashed string using a hexadecimal format for instance. But wouldn't that tremendously increase the collision probability due to its low character range (I thinik 1-16 characters only)?

Thanks for your help!

Regards

Phil

7 REPLIES 7
dsbihill
Obsidian | Level 7

Why do you need to convert this dataset from a long to a wide set?

PhilfromGermany
Fluorite | Level 6

Because the end users want to create simple files that hold formulae. For example: Observation 1 + Observation 2. I must load these formulae and execute them. This only works if the underlying observations are all in one row. Otherwise I have to use procedures and complex logic. Let's assume the design is a given.

Tom
Super User Tom
Super User

Just number them.

data middle ;

  set have ;

  by observation_id ;

  if first.observation_id then observation_number+1;

run;

proc transpose data=middle out=want prefix=obs ;

  id obervation_number ;

  var value ;

run;

ballardw
Super User

And assign the long text to the LABEL.

PhilfromGermany
Fluorite | Level 6

Tom, how is the end user going to be able to reference the observation id if I converted it to a number? He only has the actual observation id (the name), not the number which I would be assigning dynamically according to your code. Ballardw, can I reference the label like I would reference the column name when aggregating variables? I think the maximum label length is 256 characters, right? That'd still be too short.

Tom
Super User Tom
Super User

How were they going to reference the long strings?

Why exactly are you transposing the data in the first place?

Reeza
Super User

You're expecting a customer to reference a 256+ long string in formula's and not make mistakes?

Build/supply a lookup table to them.

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 7 replies
  • 3068 views
  • 1 like
  • 5 in conversation