08-06-2015 01:15 PM
this one is a bit tricky to explain. Here is the original problem though:
I have a list of 10,000 observations. Each observation has a unique observation ID which is between 100 and 400 characters long:
|(more observation IDs...)||...|
This data set now gets transposed and the observation IDs become the column names. The problem is that the observation_id's get truncated to 32 characters, because that is the maximum length for column names supported by SAS 9.4:
One problem is that they may no longer be unique when truncated. They also can no longer be referenced by people querying the data. For instance, a person wants to be able to run a program that divides Observation 1 by Observation 2 and reference those two observations by their observation ID. This is no longer possible, because they would still enter the whole/original observation ID.
I thought about creating an md5 hash for each observation ID. It shortens them to 32 bytes. But is that a safe thing to do considering the source string is much longer than 32 bytes (up to 400 characters long)? Even if it is safe because of its very low collision probability, I wouldn't be able to work with those special characters MD5 generates. So I would have convert the hashed string using a hexadecimal format for instance. But wouldn't that tremendously increase the collision probability due to its low character range (I thinik 1-16 characters only)?
Thanks for your help!
08-06-2015 01:24 PM
Because the end users want to create simple files that hold formulae. For example: Observation 1 + Observation 2. I must load these formulae and execute them. This only works if the underlying observations are all in one row. Otherwise I have to use procedures and complex logic. Let's assume the design is a given.
08-06-2015 01:44 PM
Just number them.
data middle ;
set have ;
by observation_id ;
if first.observation_id then observation_number+1;
proc transpose data=middle out=want prefix=obs ;
id obervation_number ;
var value ;
08-06-2015 03:55 PM
Tom, how is the end user going to be able to reference the observation id if I converted it to a number? He only has the actual observation id (the name), not the number which I would be assigning dynamically according to your code. Ballardw, can I reference the label like I would reference the column name when aggregating variables? I think the maximum label length is 256 characters, right? That'd still be too short.