DATA Step, Macro, Functions and more

Hashes, Checksums and all the rest...

Accepted Solution Solved
Reply
Contributor
Posts: 38
Accepted Solution

Hashes, Checksums and all the rest...

Dear all,

I am trying to find a quick way to check if a dataset (with all its columns and observations) is still looking the same as after a change to the code.

Let's say a dataset with 1mio observations and 1 row contains just an "a" in each observation.

Now I change something to my code (which in my case is some PRXPARSE action) and run the job again...I would expect that the result is the same.

How do I check that efficiently? How do I e.g. build a hash over everything a dataset contains...or a checksum?

Thanks a lot,

Thomas


Accepted Solutions
Solution
‎11-10-2011 04:46 PM
New Contributor
Posts: 4

Re: Hashes, Checksums and all the rest...

Since you are running the code anyways after your code change, something like this should work:

* first a little example data set with two variables;

data stuff;

infile cards;

input;

var1 = scan(_INFILE_,1);

var2 = scan(_INFILE_,2);

cards;

a b

a b

a c

a d

And the subsequent step with the checksum code;

data stuff(drop=check all);
  format check $hex32.;
  retain check;


  set stuff end=eof;


*  if _n_ = 3 then var1 = 'x';


  all = cats(var1,var2,check);
  check = md5(all);
  if eof then put check=;
run;

First you create a variable called check with a format of hex 32.  Then you retain it so it's value is available with each iteration of the data step.  Then set your data set with the end= option so we know when we've reached the end.  The commented out if statement represents your current data step code (and uncommenting it should change the checksum).  At the bottom is the actual checksum code.  First concatenate all the variables you are interested in checking into a variable called all.  Notice we are including the check variable.  Then use the md5() function to hash it.  Finally if it's the end of the data set then write the checksum to the log.  Run that once and note the checksum.  Then uncomment the if statement and you will see the checksum change.  The checksum is for all of the data set since each observation's checksum is built partly from the previous checksum.

Hope that helps! -stephen

View solution in original post


All Replies
Contributor
Posts: 38

Hashes, Checksums and all the rest...

Update: I found this, which is quite helpful:

http://www2.sas.com/proceedings/sugi27/p088-27.pdf

Nonetheless, is there a PROC CHECKSUM or something like it, that would do the trick?

PROC Star
Posts: 7,363

Hashes, Checksums and all the rest...

If you are only interested in whether the files are the same, your fastest check might be to pipe your operating systems file compare routine.  In windows that could be the fc command for which you can include the /LB1 option which (I think) will only allow one consecutive missmatch.  Thus, if you don't receive the "no differences encountered" message, the files contain at least one difference.

Super User
Super User
Posts: 6,502

Re: Hashes, Checksums and all the rest...

The problem with operating system file compare on SAS datasets is that datasets include metadata such as when it was created, operating system etc that is not really important in testing if the information contained in the two files are the same.

Super User
Super User
Posts: 6,502

Hashes, Checksums and all the rest...

What is changing in code? Do you expect that the new code will produce the exact same results?

Do you have access to the old data?  then use PROC COMPARE.

Checksum concept is useful to testing when you do not have access to the original data.  Such as when receiving a file or communication from someone.

Solution
‎11-10-2011 04:46 PM
New Contributor
Posts: 4

Re: Hashes, Checksums and all the rest...

Since you are running the code anyways after your code change, something like this should work:

* first a little example data set with two variables;

data stuff;

infile cards;

input;

var1 = scan(_INFILE_,1);

var2 = scan(_INFILE_,2);

cards;

a b

a b

a c

a d

And the subsequent step with the checksum code;

data stuff(drop=check all);
  format check $hex32.;
  retain check;


  set stuff end=eof;


*  if _n_ = 3 then var1 = 'x';


  all = cats(var1,var2,check);
  check = md5(all);
  if eof then put check=;
run;

First you create a variable called check with a format of hex 32.  Then you retain it so it's value is available with each iteration of the data step.  Then set your data set with the end= option so we know when we've reached the end.  The commented out if statement represents your current data step code (and uncommenting it should change the checksum).  At the bottom is the actual checksum code.  First concatenate all the variables you are interested in checking into a variable called all.  Notice we are including the check variable.  Then use the md5() function to hash it.  Finally if it's the end of the data set then write the checksum to the log.  Run that once and note the checksum.  Then uncomment the if statement and you will see the checksum change.  The checksum is for all of the data set since each observation's checksum is built partly from the previous checksum.

Hope that helps! -stephen

Trusted Advisor
Posts: 1,300

Hashes, Checksums and all the rest...

Tom speaks the truth on this one:

Tom wrote:

The problem with operating system file compare on SAS datasets is that datasets include metadata such as when it was created, operating system etc that is not really important in testing if the information contained in the two files are the same.

and I agree with his recommendation to use proc compare.  I like sascoders concept, I would not recommend using this approach for a table with many different data types though for a simple table it is interesting:

data way1;

informat v1-v2 $1.;

input v1-v2;

hash=put(md5(cats(of v1 v2)),$hex32.);

cards;

a b

c d

e f

g h

;

run;

data way2;

input v1 $ v2 $;

hash=put(md5(cats(v1,v2)),$hex32.);

cards;

a b

c d

e f

g h

;

run;

data _null_;

if 0 then set way1;

declare hash w1(dataset:'way1');

  w1.definekey('hash');

  w1.definedone();

declare hiter w1iter('w1');

declare hash w2(dataset:'way2');

  w2.definekey('hash');

  w2.definedone();

do while(w1iter.next()=0);

  if w2.find() ne 0 then

   do;

    putlog 'ERROR: These datasets are different';

          bad+1;

   end;

end;

if bad=0 then put 'NOTE: These datasets are the same';

stop;

run;

NOTE: These datasets are the same

New Contributor
Posts: 4

Re: Hashes, Checksums and all the rest...

I'm not sure why you would say it is not good for a table with many different data types? (Only two data types in SAS).  And why the subsequent step creating hashes to compare?  There is no need.  In my code, the hash (hash in the sense that it is a unique fingerprint, not a hash like associative array) at the end of the data step is unique to the entire data step, not just per observation.  Again, notice that the result of the hash of the previous observation is fed into the hashing of the current observation in my code.  Resulting in a unique hash fingerprint for the entire data step.  In one pass.  Comparing this final value to the original on subsequent runs of the data step would tell you if _any_ of the observations changed.  I believe this is what the original poster asked for.

Contributor
Posts: 38

Re: Hashes, Checksums and all the rest...

Hi all,

I just browsed through my old requests and I totally forgot to thank you for your help! So here it is: Thank you!

I have used this method for quite a while now - works like a charm. Is there a way to cats() all variables of a dataset without listing each column? Gets quite heavy with >200 columns.

Thomas

Super Contributor
Posts: 1,636

Re: Hashes, Checksums and all the rest...

you can put all your variables in a macro variable.

Contributor
Posts: 38

Re: Hashes, Checksums and all the rest...

thanks, but how does that work?

Super Contributor
Posts: 1,636

Re: Hashes, Checksums and all the rest...

RED parts using upcase

example:

data have;
input (a b c d)(:$2.);
cards;
aa bb cc dd
ab cd ef gr
;
proc sql noprint;
  select name into :vars separated by ',' from
    dictionary.columns
   where libname='WORK' and memname='HAVE';
quit;
%put &vars;
data want;
length newvar $ 20;
set have;
newvar=cats(&vars);
proc print;run;
                             Obs     newvar     a     b     c     d

                              1     aabbccdd    aa    bb    cc    dd
                              2     abcdefgr    ab    cd    ef    gr

Valued Guide
Posts: 2,175

Re: Hashes, Checksums and all the rest...

to cat all variables

try cats( of _all_ )

Here is a saslog snippet demonstrating

219  data demo ;

220  input abc $ num dum $ datex : date11. ;

221  format datex yymmdd10. ;

222  list;datalines;

RULE:    
----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----

223        qwer  3456 dum 12DEc2012

224        mnbvc  9876543 dum 21dec12345

NOTE: The data set WORK.DEMO has 2 observations and 4 variables.

NOTE: DATA statement used (Total process time):

      real time           0.01 seconds

      cpu time            0.01 seconds

225  ;

226  data catstest( compress= yes);

227     set demo ;

228     length all $10000 ;

229     all = cats( of _all_ ) ;

230     put _all_ ;

231  run ;

abc=qwer num=3456 dum=dum datex=2012-12-12 all=qwer3456dum19339 _ERROR_=0 _N_=1

abc=mnbvc num=9876543 dum=dum datex=****-12-21 all=mnbvc9876543dum3793395 _ERROR_=0 _N_=2

NOTE: There were 2 observations read

Super Contributor
Posts: 1,636

Re: Hashes, Checksums and all the rest...

Thank you Peter! - LInlin

New User
Posts: 1

Re: Hashes, Checksums and all the rest...

if i run  all = cats( of _all_ ) ; on a subset of a dataset the code works perfectly fine, however, when I try to run it on an entire dataset (20 000+ records) the code fails with the error "An error occured when executing the workspace job program.sas.  Server is disconnected"

 

Why would this happen and how do I get to run this on a large dataset?

 

I want to build MD5 checksums on large datasets without having to specify all the columns/attributes everytime.  I am looking for something generic.

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 15 replies
  • 3381 views
  • 8 likes
  • 9 in conversation