Hello,
this one is a bit tricky to explain. Here is the original problem though:
I have a list of 10,000 observations. Each observation has a unique observation ID which is between 100 and 400 characters long:
OBSERVATION_ID | VALUE |
---|---|
xxxxxxxCHARACTERS100 | 100 |
xxxxxxxxxxCHARACTERS300 | 50 |
xxxxxxxxxxxCHARACTERS400 | 25 |
(more observation IDs...) | ... |
This data set now gets transposed and the observation IDs become the column names. The problem is that the observation_id's get truncated to 32 characters, because that is the maximum length for column names supported by SAS 9.4:
xxxxxxxCHARACTER100 | xxxxxxxxxxCHARACTER3 | xxxxxxxxxxxCHARACTER |
---|---|---|
100 | 50 | 25 |
One problem is that they may no longer be unique when truncated. They also can no longer be referenced by people querying the data. For instance, a person wants to be able to run a program that divides Observation 1 by Observation 2 and reference those two observations by their observation ID. This is no longer possible, because they would still enter the whole/original observation ID.
Possible Solution:
I thought about creating an md5 hash for each observation ID. It shortens them to 32 bytes. But is that a safe thing to do considering the source string is much longer than 32 bytes (up to 400 characters long)? Even if it is safe because of its very low collision probability, I wouldn't be able to work with those special characters MD5 generates. So I would have convert the hashed string using a hexadecimal format for instance. But wouldn't that tremendously increase the collision probability due to its low character range (I thinik 1-16 characters only)?
Thanks for your help!
Regards
Phil
Why do you need to convert this dataset from a long to a wide set?
Because the end users want to create simple files that hold formulae. For example: Observation 1 + Observation 2. I must load these formulae and execute them. This only works if the underlying observations are all in one row. Otherwise I have to use procedures and complex logic. Let's assume the design is a given.
Just number them.
data middle ;
set have ;
by observation_id ;
if first.observation_id then observation_number+1;
run;
proc transpose data=middle out=want prefix=obs ;
id obervation_number ;
var value ;
run;
And assign the long text to the LABEL.
Tom, how is the end user going to be able to reference the observation id if I converted it to a number? He only has the actual observation id (the name), not the number which I would be assigning dynamically according to your code. Ballardw, can I reference the label like I would reference the column name when aggregating variables? I think the maximum label length is 256 characters, right? That'd still be too short.
How were they going to reference the long strings?
Why exactly are you transposing the data in the first place?
You're expecting a customer to reference a 256+ long string in formula's and not make mistakes?
Build/supply a lookup table to them.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.