BookmarkSubscribeRSS Feed
jbodart
Obsidian | Level 7

SAS 9.4 (TS1M7) help on the hashing_file function provides the example below, where the filename is assigned with specific options: recfm=f lrecl=5 but without explaining them.

Are these options mandatory, and why ?

I tried changing to recfm = n (because I'm dealing with binary files) and omitting lrecl (because I don't know if it's needed and if it is, which value it should take), but then the data step seemed to run forever until I killed SAS.

In my experience, some versions of SAS (e.g. linux) have problems with long filenames, and using filerefs to access those files has generally been a good workaround, that's why I want to be able to use hashing_file with a fileref.

 

filename abc temp recfm=f lrecl=5; 
data _null_; 
   file abc; 
   put '3334353637'x    /*34567 in ASCII*/; 
run;

data _null_; 
   x = hashing_file('sha256','abc',4);
   put x=; 
run;
11 REPLIES 11
Patrick
Opal | Level 21

For the sample code from the docu you share:

1. Defines a fileref for a temporary file (the physical file will have a generated name and be stored under WORK).

2. A data step that writes some data to this temporary file

3. An data step that uses the hashing_file() function to create a hash value from this file.

 

The sample code references the file already via fileref:

Patrick_2-1673002253296.png

 

Patrick_1-1673002201205.png

 

 

jbodart
Obsidian | Level 7

Thanks for the reply, but my question is specifically about the recfm and lrecl values that should be used to use the function with any file (text or binary, of any size).  In this example, very specific values have been specified for these options, are these important (it seems recfm should not be set to 'n') and if yes how should they be set?

yabwon
Onyx | Level 15

Hi, 

 

In this particular example the code writes 5 bytes into a file:

put '3334353637'x

the "recfm=f" means that file you write into has fixed record length and that length is "lrecl=5".

 

 

For example, if you want to do a binary copy of a file then recfm=n and lrecl=1 is ok, because you want to copy it byte by byte:

filename org "/some/directory/binary.file" recfm=n lrecl=1;
filename copy "/other/directory/binary.file" recfm=n lrecl=1;

data _null_;
  rc = fcopy("org", "copy");
run;

 

In this example below if you do:

filename abc temp recfm=f lrecl=5; /*fixed length record*/ 
data _null_; 
   file abc; 
   put '3334353637'x    ; 
run;

filename efg temp; /* default are: RECFM=V,LRECL=32767 */
data _null_; 
   file efg; 
   put '3334353637'x; 
run;

filename hij temp recfm=f lrecl=10; /* bigger record length */
data _null_; 
   file hij; 
   put '3334353637'x; 
run;

data _null_; 
   x = hashing_file('sha256','abc',4);
   put x=;
   y = hashing_file('sha256','efg',4);
   put y=;
   z = hashing_file('sha256','hij',4);
   put z=; 

   t1=(z=x);
   t2=(x=y);

   put (t:) (=);
run;

the log says:

 

1
2    filename abc temp recfm=f lrecl=5; /*fixed length record*/
3    data _null_;
4       file abc;
5       put '3334353637'x    ;
6    run;

NOTE: The file ABC is:
      Filename=R:\_TD14472_YABWONL5P_\#LN00024,
      RECFM=F,LRECL=5,File Size (bytes)=0,
      Last Modified=06Jan2023:13:26:00,
      Create Time=06Jan2023:13:26:00

NOTE: 1 record was written to the file ABC.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds


7
8    filename efg temp;
9    data _null_;
10      file efg;
11      put '3334353637'x;
12   run;

NOTE: The file EFG is:
      Filename=R:\_TD14472_YABWONL5P_\#LN00025,
      RECFM=V,LRECL=32767,File Size (bytes)=0,
      Last Modified=06Jan2023:13:26:00,
      Create Time=06Jan2023:13:26:00

NOTE: 1 record was written to the file EFG.
      The minimum record length was 5.
      The maximum record length was 5.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds


13
14   filename hij temp recfm=f lrecl=10; /* bigger record length */
15   data _null_;
16      file hij;
17      put '3334353637'x;
18   run;

NOTE: The file HIJ is:
      Filename=R:\_TD14472_YABWONL5P_\#LN00026,
      RECFM=F,LRECL=10,File Size (bytes)=0,
      Last Modified=06Jan2023:13:26:00,
      Create Time=06Jan2023:13:26:00

NOTE: 1 record was written to the file HIJ.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds


19
20   data _null_;
21      x = hashing_file('sha256','abc',4);
22      put x=;
23      y = hashing_file('sha256','efg',4);
24      put y=;
25      z = hashing_file('sha256','hij',4);
26      put z=;
27
28      t1=(z=x);
29      t2=(x=y);
30
31      put (t:) (=);
32   run;

x=831D606C1FF4CD3B74522885194FC0DED5F5BE1E043A6A46E59B08896F452657
y=831D606C1FF4CD3B74522885194FC0DED5F5BE1E043A6A46E59B08896F452657
z=FB405A7E053B65FC22B0F127F0374FD0C4CF1DDFB188DC7C6F02F56C13B3B246
t1=0 t2=1
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds

So change of record length cause SAS to "fill" rest of line with extra spaces in the HIJ file. And the content of ABC and EFG is the same and gives the same hash digest. Right? It is the same, right!? Lets see.

 

There is one "funny/scary" thing in the log you have above when you compare to the following print screen:

yabwon_0-1673011533431.png

It shows content and size of each file, ABC, EFG, and HIJ.

So it is 5, 7, and 10 bytes! But hey, wtf?!?!?!

We have 2 files of different size with two "different" content (EFG has "\r\n" characters at the end) and exactly the same digests:

x=831D606C1FF4CD3B74522885194FC0DED5F5BE1E043A6A46E59B08896F452657
y=831D606C1FF4CD3B74522885194FC0DED5F5BE1E043A6A46E59B08896F452657

?!?!?

 

Is it an error there? Or maybe a bug in SAS? Well... no. It is all ok, though surprising in the first contact.

 

Digests are the same because SAS is hashing content of the referenced file.

For ABC it cuts 5 bytes of the file and digest it with hashing function. Since file is said to be fixed record length it takes only 5 bytes from each line.

For EFG it cuts content of the first line, up to the line end, which is 5 bytes in this case, and digest it. 

So that's why the results are the same. 

 

Now the next question is how to get a digest of file, not it content.

Lets create the same file, but instead TEMP lets use standard path to file, to make it easier to access:

filename efg "%sysfunc(pathname(work))/efg"; 
data _null_; 
   file efg; 
   put '3334353637'x; 
run;

So we have:

yabwon_1-1673013054357.png

 

First idea would be similar to the idea from the top of this post (the one about FCOPY()):

filename efg "%sysfunc(pathname(work))/efg" recfm=n lrecl=1;
data _null_; 
   y = hashing_file('sha256',"efg",4);
   put y=;
run;

But it is a BAD ideaIt will only "hang" your SAS session since it would cause SAS to digest content of an "infinite file" (file without end).

 

In situation when you want to get hash digest of whole file you shouldn't use fileref but a direct file path instead. "But I already have a very convenient fileref, why I have to get file path?" - you say. That's not a problem, the "%sysfunc(pathname(<...>,F))" pair will do the job for you ("F" is for File). So all you need to do is:

data _null_; 
   y = hashing_file('sha256', "%sysfunc(pathname(efg,F))",0);
   put y=;
run;

 

All the best

Bart

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation



jbodart
Obsidian | Level 7

Thanks Bart for the detailed answer and examples.

 

Can you explain why using recfm=n lrecl=1 results in digesting and an "infinite file" (file without end) ?

 

Also, if I want to calculate the digest of a binary file that happens to contain line ending characters, those characters will be excluded from the checksum, correct?  So the checksum returned by hashing_file() won't match the checksum calculated by an external tool such as: 

certutil -hashfile /path/to/file sha256

?

yabwon
Onyx | Level 15

Hi 🙂

 

For the first question, frankly... don't know... especially that for the situation with FCOPY():

1    filename a "%sysfunc(pathname(work,L))/a";
2    data _null_;
3      file a;
4      put "abc";
5    run;

NOTE: The file A is:
      Filename=R:\_TD22020_YABWONL5P_\a,
      RECFM=V,LRECL=32767,File Size (bytes)=0,
      Last Modified=06Jan2023:16:46:47,
      Create Time=06Jan2023:16:23:12

NOTE: 1 record was written to the file A.
      The minimum record length was 3.
      The maximum record length was 3.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds


6
7    filename a "%sysfunc(pathname(work,L))/a" recfm=n lrecl=1;
8    filename b "%sysfunc(pathname(work,L))/b_is_copy_of_a" recfm=n lrecl=1;
9    data _null_;
10     rc = fcopy("A","B");
11     rctxt=sysmsg();
12   run;

NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds

it works perfectly.

Maybe, for the hashing_file with fileref, it has something to do with this error:

13   filename a "%sysfunc(pathname(work,L))/a" recfm=n lrecl=1;
14   data _null_;
15     infile a;
16     input x $1.;
17     put x hex2.;
18   run;

ERROR: The LRECL specified using the RECFM=N FILE/INFILE option must be greater than or equal to 256.
NOTE: The SAS System stopped processing this step because of errors.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds

? But that is just my guessing.

 

About the second question. From my observations:

if you are using fileref:

hashing_file('sha256','abc',4);

it makes digest of the content of the file (so may drop end line characters). But when you use direct file path:

hashing_file('sha256','/direct/path/to/the/file/abc.txt',0);

it will digest the file itself.

 

 

Bart

 

 

 

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation



Tom
Super User Tom
Super User

Why would you use %sysfunc() in a data step?

   y = hashing_file('sha256', pathname('efg'),0);

 

yabwon
Onyx | Level 15

You are 100% right, there is no point, just my bad habit... 😉

 

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation



Patrick
Opal | Level 21

@jbodart wrote:

Thanks for the reply, but my question is specifically about the recfm and lrecl values that should be used to use the function with any file (text or binary, of any size).  In this example, very specific values have been specified for these options, are these important (it seems recfm should not be set to 'n') and if yes how should they be set?


Hi @jbodart 

I strongly assume that for the sample code in the docu the only reason to define record length and format explicitly is to ensure that the script always creates the same file "to the dot" so the result of the hanshing_file() function will be the same for any environment where it's run.

 

The hashing_file() function works for any type of file. I strongly assume the function reads the source as a binary data stream and work always on a binary level - which means any "character" in the file - line feeds included - will contribute to the result.

 

Example 1: SAS Table

libname out "c:\temp";
data out.class; 
   set sashelp.class;
run;
libname out clear;

data _null_; 
   x = hashing_file('sha256','c:\temp\class.sas7bdat',0);
   put x=; 
run;

 

Patrick_5-1673052501081.png

 

Example 2: Text files only differing in number of line feeds

Patrick_3-1673052096509.png    Patrick_2-1673052072117.png

data _null_; 
   x = hashing_file('sha256','c:\temp\test_file1.txt',0);
   put 'test_file1.txt:' x=; 
   x = hashing_file('sha256','c:\temp\test_file2.txt',0);
   put 'test_file2.txt:' x=; 
run;

Patrick_4-1673052157266.png

 

 

 

 

 

Tom
Super User Tom
Super User

They used the RECFM=F and LRECL=5 options to make sure they created a binary file of exact 5 bytes long. 

 

I doubt it has anything to do with using the HASHING_FILE() function.

 

If you want to have a discussion on the meaning of the RECFM and LRECL options of the FILE and INFILE statements then start a new thread.

jbodart
Obsidian | Level 7
Thanks Tom. Any idea why a fileref with RECFM=N seems to make HASHING_FILE() enter an infinite loop?
Tom
Super User Tom
Super User

@jbodart wrote:
Thanks Tom. Any idea why a fileref with RECFM=N seems to make HASHING_FILE() enter an infinite loop?

Perhaps it cannot detect the end of the file?

 

If you actually have a physical file you are passing you are probably better off using the filename instead of the fileref.

 

You might also want to experiment more by trying filerefs defined with other filename engines, like ZIP or URL, and see whether it causes trouble for that hashing function.

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 11 replies
  • 1614 views
  • 3 likes
  • 4 in conversation