Hi all I'm a SAS newbie and need some help with the following issue. I'm experimenting with the md5 hash function and testing it on the sashelp.shoes dataset.
In a datastep, for each input observation, I'm creating a new variable that is a concatenation of each field using catx with comma as the field separator. I then use that variable as input to the md5 function and sure enough I get a hash value back which I print out to the log using PUT. I repeated this for each input line. However I then exported the shoes dataset to a CSV text file. The export enclosed the fields in double quotes and included $ and commas for the currency fields which I removed manually from the CSV file. So visually the the lines of the CSV looked exactly like the lines output from SAS. Next I wrote a little python script to read each line of the CSV text file and calculate a md5 hash for that. Unfortunately none of the hashes for the CSV file matched the hashes from SAS. Has anybody done something similar and if so can you tell me where I'm going wrong. I know the python code is correct as I checked the results using the built-in md5 checker in windows.
When I get into work again tomorrow I'll post some of the code I'm using if that helps. Meanwhile if anyone can help that would be appreciated
I'm using enterprise guide V4.2 I think
Simplify and ensure that SAS and Python MD5 functions return the same results (AFAIK they should but you seem to be wanting to verify that):
data test;
set sashelp.class (keep=name);
hash1=md5(name);
hash2=md5(strip(name));
hash1x=put(hash1,hex32.);
hash2x=put(hash2,hex32.);
run;
Also note that SAS variables are padded with spaces to the length of the variable. Note the difference between md5(var) vs. md5(strip(var)).
Get SAS and Python to generate the same MD5 result for "Alfred", "Alice", "Bob", etc, then add complexity (concatenated columns) to the mix.
I don't know if this will help with your export issues? https://github.com/scottbass/SAS/blob/master/Macro/export.sas
Simplify and ensure that SAS and Python MD5 functions return the same results (AFAIK they should but you seem to be wanting to verify that):
data test;
set sashelp.class (keep=name);
hash1=md5(name);
hash2=md5(strip(name));
hash1x=put(hash1,hex32.);
hash2x=put(hash2,hex32.);
run;
Also note that SAS variables are padded with spaces to the length of the variable. Note the difference between md5(var) vs. md5(strip(var)).
Get SAS and Python to generate the same MD5 result for "Alfred", "Alice", "Bob", etc, then add complexity (concatenated columns) to the mix.
I don't know if this will help with your export issues? https://github.com/scottbass/SAS/blob/master/Macro/export.sas
As promised , here is the SAS code I'm using and the first few output records I'm getting.
/* concatenate all fields of a dataset and compute a checksum */
proc sql;
select name into :varstr2 separated by ','
from dictionary.columns
where libname = "SASHELP" and
memname = "SHOES";
quit;
data stuff(drop=check all);
format check $hex32.;
set sashelp.shoes end=end1;
newvar2 = catx(',',&varstr2);
all = catx(',',&varstr2);
check = md5(all);
put all;
put check;
run;
Africa,Boot,Addis Ababa,12,29761,191821,769
0F7503F59119E8248D89ED645F886871
Africa,Men's Casual,Addis Ababa,4,67242,118036,2284
8066D31E7C2A254EAB127C121B526DF7
Africa,Men's Dress,Addis Ababa,7,76793,136273,2433
653E4A1DF8B5708DF9C8B97587A1E981
Africa,Sandal,Addis Ababa,10,62819,204284,1861
D59E63E5319B4E3018F28D46A4CED9F9
Africa,Slipper,Addis Ababa,14,68641,279795,1771
1612FC1FE23B55078B7693ECE1E6D028
Now here is the python code and the same output records I'm getting for that:
import hashlib
filename = "f:/test/shoes.csv"
md5_hash = hashlib.md5()
with open(filename,"r") as f:
for x in f:
result=hashlib.md5(x.encode('utf-8'))
print (x)
print(result.hexdigest())
Africa,Boot,Addis Ababa,12,29761,191821,769
7001aaebd146b10aaed951cb692c6c4b
Africa,Men's Casual,Addis Ababa,4,67242,118036,2284
916a0c39554b70d691d03c71e8daa763
Africa,Men's Dress,Addis Ababa,7,76793,136273,2433
ea9e85e9843d3bb02206bc0ba7c3d5d4
Africa,Sandal,Addis Ababa,10,62819,204284,1861
5865cfc5d443b5a2e0038c573b5b6fb9
Africa,Slipper,Addis Ababa,14,68641,279795,1771
0226115fb928f326044ca43e186ae23a
Update. I was thinking it might be something to do with newlines/linefeeds on the python side so changed my code to just look at the first input string in isolation.
import hashlib
x="Africa,Boot,Addis Ababa,12,29761,191821,769"
md5_hash = hashlib.md5()
result=hashlib.md5(x.encode('utf-8'))
print (x)
print(result.hexdigest())
Africa,Boot,Addis Ababa,12,29761,191821,769
65d38fa13c098fc3959b1eb0c19b0427
Hmmm, still doesn't match with the SAS version
Just out of interest, does anyone have code that would calculate the md5 hash for a whole - potentially very large , dataset ?
Since I can calculate the hash for each record I wondered if I could just hash the hashes as it were ?
@taupirho wrote:
Yes, even though I was using catx which is supposed to strip spaces from beginning/end of strings, you still need that "strip" in the call to md5 function. Many thanks for your input.
catx remove leading and trailing blanks of the arguments you pass to function, but it will not strip trailing blanks from the value returned by the function - simply because all variables are padded with blanks.
From the docs:
In a DATA step, if the CATX function returns a value to a variable that has not previously been assigned a length, that variable is given a length of 200 bytes.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.