BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
thomash123
Calcite | Level 5

Dear all,

I am trying to find a quick way to check if a dataset (with all its columns and observations) is still looking the same as after a change to the code.

Let's say a dataset with 1mio observations and 1 row contains just an "a" in each observation.

Now I change something to my code (which in my case is some PRXPARSE action) and run the job again...I would expect that the result is the same.

How do I check that efficiently? How do I e.g. build a hash over everything a dataset contains...or a checksum?

Thanks a lot,

Thomas

1 ACCEPTED SOLUTION

Accepted Solutions
sasCoders_com
Calcite | Level 5

Since you are running the code anyways after your code change, something like this should work:

* first a little example data set with two variables;

data stuff;

infile cards;

input;

var1 = scan(_INFILE_,1);

var2 = scan(_INFILE_,2);

cards;

a b

a b

a c

a d

And the subsequent step with the checksum code;

data stuff(drop=check all);
  format check $hex32.;
  retain check;


  set stuff end=eof;


*  if _n_ = 3 then var1 = 'x';


  all = cats(var1,var2,check);
  check = md5(all);
  if eof then put check=;
run;

First you create a variable called check with a format of hex 32.  Then you retain it so it's value is available with each iteration of the data step.  Then set your data set with the end= option so we know when we've reached the end.  The commented out if statement represents your current data step code (and uncommenting it should change the checksum).  At the bottom is the actual checksum code.  First concatenate all the variables you are interested in checking into a variable called all.  Notice we are including the check variable.  Then use the md5() function to hash it.  Finally if it's the end of the data set then write the checksum to the log.  Run that once and note the checksum.  Then uncomment the if statement and you will see the checksum change.  The checksum is for all of the data set since each observation's checksum is built partly from the previous checksum.

Hope that helps! -stephen

View solution in original post

15 REPLIES 15
thomash123
Calcite | Level 5

Update: I found this, which is quite helpful:

http://www2.sas.com/proceedings/sugi27/p088-27.pdf

Nonetheless, is there a PROC CHECKSUM or something like it, that would do the trick?

art297
Opal | Level 21

If you are only interested in whether the files are the same, your fastest check might be to pipe your operating systems file compare routine.  In windows that could be the fc command for which you can include the /LB1 option which (I think) will only allow one consecutive missmatch.  Thus, if you don't receive the "no differences encountered" message, the files contain at least one difference.

Tom
Super User Tom
Super User

The problem with operating system file compare on SAS datasets is that datasets include metadata such as when it was created, operating system etc that is not really important in testing if the information contained in the two files are the same.

Tom
Super User Tom
Super User

What is changing in code? Do you expect that the new code will produce the exact same results?

Do you have access to the old data?  then use PROC COMPARE.

Checksum concept is useful to testing when you do not have access to the original data.  Such as when receiving a file or communication from someone.

sasCoders_com
Calcite | Level 5

Since you are running the code anyways after your code change, something like this should work:

* first a little example data set with two variables;

data stuff;

infile cards;

input;

var1 = scan(_INFILE_,1);

var2 = scan(_INFILE_,2);

cards;

a b

a b

a c

a d

And the subsequent step with the checksum code;

data stuff(drop=check all);
  format check $hex32.;
  retain check;


  set stuff end=eof;


*  if _n_ = 3 then var1 = 'x';


  all = cats(var1,var2,check);
  check = md5(all);
  if eof then put check=;
run;

First you create a variable called check with a format of hex 32.  Then you retain it so it's value is available with each iteration of the data step.  Then set your data set with the end= option so we know when we've reached the end.  The commented out if statement represents your current data step code (and uncommenting it should change the checksum).  At the bottom is the actual checksum code.  First concatenate all the variables you are interested in checking into a variable called all.  Notice we are including the check variable.  Then use the md5() function to hash it.  Finally if it's the end of the data set then write the checksum to the log.  Run that once and note the checksum.  Then uncomment the if statement and you will see the checksum change.  The checksum is for all of the data set since each observation's checksum is built partly from the previous checksum.

Hope that helps! -stephen

FriedEgg
SAS Employee

Tom speaks the truth on this one:

Tom wrote:

The problem with operating system file compare on SAS datasets is that datasets include metadata such as when it was created, operating system etc that is not really important in testing if the information contained in the two files are the same.

and I agree with his recommendation to use proc compare.  I like sascoders concept, I would not recommend using this approach for a table with many different data types though for a simple table it is interesting:

data way1;

informat v1-v2 $1.;

input v1-v2;

hash=put(md5(cats(of v1 v2)),$hex32.);

cards;

a b

c d

e f

g h

;

run;

data way2;

input v1 $ v2 $;

hash=put(md5(cats(v1,v2)),$hex32.);

cards;

a b

c d

e f

g h

;

run;

data _null_;

if 0 then set way1;

declare hash w1(dataset:'way1');

  w1.definekey('hash');

  w1.definedone();

declare hiter w1iter('w1');

declare hash w2(dataset:'way2');

  w2.definekey('hash');

  w2.definedone();

do while(w1iter.next()=0);

  if w2.find() ne 0 then

   do;

    putlog 'ERROR: These datasets are different';

          bad+1;

   end;

end;

if bad=0 then put 'NOTE: These datasets are the same';

stop;

run;

NOTE: These datasets are the same

sasCoders_com
Calcite | Level 5

I'm not sure why you would say it is not good for a table with many different data types? (Only two data types in SAS).  And why the subsequent step creating hashes to compare?  There is no need.  In my code, the hash (hash in the sense that it is a unique fingerprint, not a hash like associative array) at the end of the data step is unique to the entire data step, not just per observation.  Again, notice that the result of the hash of the previous observation is fed into the hashing of the current observation in my code.  Resulting in a unique hash fingerprint for the entire data step.  In one pass.  Comparing this final value to the original on subsequent runs of the data step would tell you if _any_ of the observations changed.  I believe this is what the original poster asked for.

thomash123
Calcite | Level 5

Hi all,

I just browsed through my old requests and I totally forgot to thank you for your help! So here it is: Thank you!

I have used this method for quite a while now - works like a charm. Is there a way to cats() all variables of a dataset without listing each column? Gets quite heavy with >200 columns.

Thomas

Linlin
Lapis Lazuli | Level 10

you can put all your variables in a macro variable.

thomash123
Calcite | Level 5

thanks, but how does that work?

Linlin
Lapis Lazuli | Level 10

RED parts using upcase

example:

data have;
input (a b c d)(:$2.);
cards;
aa bb cc dd
ab cd ef gr
;
proc sql noprint;
  select name into :vars separated by ',' from
    dictionary.columns
   where libname='WORK' and memname='HAVE';
quit;
%put &vars;
data want;
length newvar $ 20;
set have;
newvar=cats(&vars);
proc print;run;
                             Obs     newvar     a     b     c     d

                              1     aabbccdd    aa    bb    cc    dd
                              2     abcdefgr    ab    cd    ef    gr

Peter_C
Rhodochrosite | Level 12

to cat all variables

try cats( of _all_ )

Here is a saslog snippet demonstrating

219  data demo ;

220  input abc $ num dum $ datex : date11. ;

221  format datex yymmdd10. ;

222  list;datalines;

RULE:    
----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----

223        qwer  3456 dum 12DEc2012

224        mnbvc  9876543 dum 21dec12345

NOTE: The data set WORK.DEMO has 2 observations and 4 variables.

NOTE: DATA statement used (Total process time):

      real time           0.01 seconds

      cpu time            0.01 seconds

225  ;

226  data catstest( compress= yes);

227     set demo ;

228     length all $10000 ;

229     all = cats( of _all_ ) ;

230     put _all_ ;

231  run ;

abc=qwer num=3456 dum=dum datex=2012-12-12 all=qwer3456dum19339 _ERROR_=0 _N_=1

abc=mnbvc num=9876543 dum=dum datex=****-12-21 all=mnbvc9876543dum3793395 _ERROR_=0 _N_=2

NOTE: There were 2 observations read

Linlin
Lapis Lazuli | Level 10

Thank you Peter! - LInlin

marinavdl
Calcite | Level 5

if i run  all = cats( of _all_ ) ; on a subset of a dataset the code works perfectly fine, however, when I try to run it on an entire dataset (20 000+ records) the code fails with the error "An error occured when executing the workspace job program.sas.  Server is disconnected"

 

Why would this happen and how do I get to run this on a large dataset?

 

I want to build MD5 checksums on large datasets without having to specify all the columns/attributes everytime.  I am looking for something generic.

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 15 replies
  • 13333 views
  • 8 likes
  • 9 in conversation