BookmarkSubscribeRSS Feed
Wolverine
Pyrite | Level 9

I have a variable called encrypted_id that has the following characteristics:

 

Type: Char

Length: 200

Format: $HEX64.

 

I had to export a file into Excel for manual review/editing, and then I imported back into SAS.  Now I need to match it with the original files.  But the encrypted_id  variable was changed during the export/import process.  It now has these characteristics and does not match the original variable:

 

Type: Char

Length: 66

Format: $66.

 

I tried this, but it didn't work and the log says "Invalid argument to function INPUT":

 

DATA temp.SCA_Atb_NPI_manual_review; SET temp.SCA_Atb_NPI_manual_review;

encrypted_id_hex = input(encrypted_id,$HEX64.); RUN;

8 REPLIES 8
Kurt_Bremser
Super User

If you have to deal with such issues, step #1 is to get rid of Excel files as means for data transfer between applications. Excel files are CRAP for that.

(Sorry if the word is offensive, but it's an apt description)

By not using xls, but a textual format, you open yourself up to use data steps for export and import, where you can take complete control over how such variables are written/read.

 

Let's get into details:

Type: Char
Length: 200
Format: $HEX64.

That's nonsense. Either you have up to 168 bytes of data that are never displayed, or you waste 168 bytes in every observation. The $hex64. format is meant for variables with a length of 32, so the variable should be defined with that length in the first place.

Type: Char
Length: 66
Format: $66.

That's what happens when proc import makes guesses; it sees something that is not a number, so it must be character, and then it takes the maximum encountered length to define the variable. Since your original data only contained 64 hex characters, there's a fair chance that Excel already mangled some of the data into unusable junk. Or some manipulation added data.

 

 

Wolverine
Pyrite | Level 9

Believe me, you're preaching to the choir.  I waste an inordinate amount of time trying to get SAS and Excel to work together.  But I'm basically in the middle of the flowchart -- the data files are produced upstream by someone else (who technically doesn't even work on this project but has access to the data warehouse where the info comes from).  I process the files, pulling out the records we're interested in, and then pass them along to the principal investigator.  To review, she needs to be able to easily display and sort, add comments and additional variables, etc.  Excel does that and it's something that we already have on our computers. 

 

The upstream person also created the encrypted_ID variable.  I don't know why it has a length of 200.  It displays with a length of 64 characters with no extra spaces.  Here is an example: 20850C1CF365857E237F9D83A9F5DCCD7525F99AB62987BB025C2430ED8B1E90

 

I also discovered that Excel inserts a line break for some reason.  So that same ID in Excel is displayed as

20850C1CF365857E237F9D83A9F5DCCD7525F99

AB62987BB025C2430ED8B1E90

even though the entire number is contained in the same cell.  But I suspect the line break may explain the extra 2 characters that give it a length of 66 instead of 64.

 

When I figured this out, it gave me an idea.  I simply created truncated versions of the IDs with a length of 40, and then used those to match.  I had to manually compare the full IDs to make sure there weren't 2 separate IDs that truncated to the same ID (there weren't).  It's ugly, but it worked.

 

Tom
Super User Tom
Super User

Again HOW is the data getting moved to Excel.  If you use ODS EXCEL and print the result then it is probably what is "wrapping" the 64 characters into two lines in the Excel cell.

 

You can just remove the linebreaks after you have the value back into SAS.

code=compress(code,'0D0A'x)

Then you can either convert it back into the 32 byte string.

text=input(code,$hex64.);
format text $hex64.;

Or convert the original value from the original 32 byte string into the 64 byte hex encoded string.

code=put(text,$hex64.);
Kurt_Bremser
Super User

The most probable reason for the variable length being 200 is that this somebody never heard of Maxim 47, and many string functions will create new variables with a default length of 200 if no length was specified. They also do not know that data modeling is the core of all programming (Maxim 33), so they do not pay attention. It is our job to set things straight and hammer the data into usable shape.

 

If you need to hand over and retrieve data to/from Excel, make sure that it is stored in the most simple form. Keys (most often UUIDs these days) need to be expanded to readable form, so that they are handled as strings by Excel. Upon return, you can then read the with a $HEX format to return them to the more efficient binary storage.

 

Whatever you do, always insist that spreadsheet data is stored in csv files when it needs to be imported into SAS. If you get xls or xlsx files, open them with Excel and store as csv yourself. Reading directly from Excel files is uncontrollable and unsustainable in the long run, as both proc import and libname excel have to rely on guesses.

Kurt_Bremser
Super User

PS I do have a similar issue here. The keys in our database are UUIDs, so they are stored in SAS as 16-byte character, with a $hex32. format attached.

I originally built two macros, one that converts the 36-character human-readable original string we get from DB/2 to this format, and another that recreates the 36-character string from the compressed 16-byte value. Since SAS has introduced the $UUID formats and informats in the meantime (the informat seems to be available from 9.4M6 on), these can be greatly simplified.

When I have to send data to non-SAS users, I always do the conversion to human-readable form and send the longer strings.

andreas_lds
Jade | Level 19

Difficult to say something useful without seeing the data. But this is the normal nightmare when Excel is used as a gui replacement. Excel does not enforce one data-type for each variable/column, so you will always have more or less interesting issues to solve.

 

The format could be changed by using the format-statement, but if the data was modified by one the wonderful automatisms in Excel, changing the format is useless. So please post data before and after it was processed by excel. For both cases, please post the data as data-step.

Tom
Super User Tom
Super User

Why do you have variable with a length of $200 but a format of $HEX64?  That doesn't make sense. Most likely the variable was created with a length of $200 because it first appeared in an a statement as the result of a function call.  

 

How did you transfer the data to an Excel file? Was the original (up to 200) character data transferred? Or was the 64 character hex string generated by the $HEX64 format transferred?

 

Where did the extra two characters come from on the re-import?  Did someone modify the data in Excel?  Why was the length determined to be 66 when the max length string that the $HEX64. format would generate would be 64?  Perhaps the raw values were transferred and returned?  In which case the original maximum length string in the data was 66 characters instead of the 32 characters that the $HEX64 format would display.  Again why is there a $HEX64. format attached to a variable that can hold more than 32 characters?

 

So assume that the data actually made the round trip into Excel with being modified you have one of two possibilities.

1) The original values are back and you just need to attach the $HEX format to see them the same we you saw them before. But since the maximum length seems to be 66 and not 32 you should use $HEX134. instead of $HEX64.

2) The formatted values got returned.  In which case you might want to use the $HEX INFORMAT to convert the values from 2 bytes strings back into a single byte characters.  So the hex code of '20' would be converting into a space and '4A' would be converted into 'J'.

 

ChrisNZ
Tourmaline | Level 20

I'll add that formatting in hex does not encrypt anything.

It is trivial to retrieve the original value.

Use functions MD5() or SHA256() to encrypt. 

To merge back to the original data, either merge by encrypted value, or keep a mapping table.

 

MD5 is 32-byte long when formatted in hex, so should avoid the Excel-inserts-garbage issue.

 

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 8 replies
  • 6507 views
  • 1 like
  • 5 in conversation