SAS Programming

thomash123 · Posted 11-08-2011 08:44 AM

Dear all,

I am trying to find a quick way to check if a dataset (with all its columns and observations) is still looking the same as after a change to the code.

Let's say a dataset with 1mio observations and 1 row contains just an "a" in each observation.

Now I change something to my code (which in my case is some PRXPARSE action) and run the job again...I would expect that the result is the same.

How do I check that efficiently? How do I e.g. build a hash over everything a dataset contains...or a checksum?

Thanks a lot,

Thomas

sasCoders_com · Posted 11-10-2011 04:45 PM

Since you are running the code anyways after your code change, something like this should work:

* first a little example data set with two variables;

data stuff;

infile cards;

input;

var1 = scan(_INFILE_,1);

var2 = scan(_INFILE_,2);

cards;

a b

a c

a d

And the subsequent step with the checksum code;

data stuff(drop=check all);
format check $hex32.;
retain check;

set stuff end=eof;

* if _n_ = 3 then var1 = 'x';

all = cats(var1,var2,check);
check = md5(all);
if eof then put check=;
run;

First you create a variable called check with a format of hex 32. Then you retain it so it's value is available with each iteration of the data step. Then set your data set with the end= option so we know when we've reached the end. The commented out if statement represents your current data step code (and uncommenting it should change the checksum). At the bottom is the actual checksum code. First concatenate all the variables you are interested in checking into a variable called all. Notice we are including the check variable. Then use the md5() function to hash it. Finally if it's the end of the data set then write the checksum to the log. Run that once and note the checksum. Then uncomment the if statement and you will see the checksum change. The checksum is for all of the data set since each observation's checksum is built partly from the previous checksum.

Hope that helps! -stephen

View solution in original post

thomash123 · Posted 11-08-2011 09:17 AM

Update: I found this, which is quite helpful:

http://www2.sas.com/proceedings/sugi27/p088-27.pdf

Nonetheless, is there a PROC CHECKSUM or something like it, that would do the trick?

art297 · Posted 11-08-2011 09:19 AM

If you are only interested in whether the files are the same, your fastest check might be to pipe your operating systems file compare routine. In windows that could be the fc command for which you can include the /LB1 option which (I think) will only allow one consecutive missmatch. Thus, if you don't receive the "no differences encountered" message, the files contain at least one difference.

Tom · Posted 11-08-2011 10:08 AM

The problem with operating system file compare on SAS datasets is that datasets include metadata such as when it was created, operating system etc that is not really important in testing if the information contained in the two files are the same.

Tom · Posted 11-08-2011 10:07 AM

What is changing in code? Do you expect that the new code will produce the exact same results?

Do you have access to the old data? then use PROC COMPARE.

Checksum concept is useful to testing when you do not have access to the original data. Such as when receiving a file or communication from someone.

sasCoders_com · Posted 11-10-2011 04:45 PM

Since you are running the code anyways after your code change, something like this should work:

* first a little example data set with two variables;

data stuff;

infile cards;

input;

var1 = scan(_INFILE_,1);

var2 = scan(_INFILE_,2);

cards;

a b

a c

a d

And the subsequent step with the checksum code;

data stuff(drop=check all);
format check $hex32.;
retain check;

set stuff end=eof;

* if _n_ = 3 then var1 = 'x';

all = cats(var1,var2,check);
check = md5(all);
if eof then put check=;
run;

First you create a variable called check with a format of hex 32. Then you retain it so it's value is available with each iteration of the data step. Then set your data set with the end= option so we know when we've reached the end. The commented out if statement represents your current data step code (and uncommenting it should change the checksum). At the bottom is the actual checksum code. First concatenate all the variables you are interested in checking into a variable called all. Notice we are including the check variable. Then use the md5() function to hash it. Finally if it's the end of the data set then write the checksum to the log. Run that once and note the checksum. Then uncomment the if statement and you will see the checksum change. The checksum is for all of the data set since each observation's checksum is built partly from the previous checksum.

Hope that helps! -stephen

FriedEgg · Posted 11-10-2011 05:46 PM

Tom speaks the truth on this one:

Tom wrote:

The problem with operating system file compare on SAS datasets is that datasets include metadata such as when it was created, operating system etc that is not really important in testing if the information contained in the two files are the same.

and I agree with his recommendation to use proc compare. I like sascoders concept, I would not recommend using this approach for a table with many different data types though for a simple table it is interesting:

data way1;

informat v1-v2 $1.;

input v1-v2;

hash=put(md5(cats(of v1 v2)),$hex32.);

cards;

a b

c d

e f

g h

;

run;

data way2;

input v1 $ v2 $;

hash=put(md5(cats(v1,v2)),$hex32.);

cards;

a b

c d

e f

g h

;

run;

data _null_;

if 0 then set way1;

declare hash w1(dataset:'way1');

w1.definekey('hash');

w1.definedone();

declare hiter w1iter('w1');

declare hash w2(dataset:'way2');

w2.definekey('hash');

w2.definedone();

do while(w1iter.next()=0);

if w2.find() ne 0 then

do;

putlog 'ERROR: These datasets are different';

bad+1;

end;

if bad=0 then put 'NOTE: These datasets are the same';

stop;

run;

NOTE: These datasets are the same

sasCoders_com · Posted 11-10-2011 06:17 PM

I'm not sure why you would say it is not good for a table with many different data types? (Only two data types in SAS). And why the subsequent step creating hashes to compare? There is no need. In my code, the hash (hash in the sense that it is a unique fingerprint, not a hash like associative array) at the end of the data step is unique to the entire data step, not just per observation. Again, notice that the result of the hash of the previous observation is fed into the hashing of the current observation in my code. Resulting in a unique hash fingerprint for the entire data step. In one pass. Comparing this final value to the original on subsequent runs of the data step would tell you if _any_ of the observations changed. I believe this is what the original poster asked for.

thomash123 · Posted 07-27-2012 09:42 AM

Hi all,

I just browsed through my old requests and I totally forgot to thank you for your help! So here it is: Thank you!

I have used this method for quite a while now - works like a charm. Is there a way to cats() all variables of a dataset without listing each column? Gets quite heavy with >200 columns.

Thomas

Linlin · Posted 07-27-2012 09:45 AM

you can put all your variables in a macro variable.

thomash123 · Posted 07-27-2012 09:49 AM

thanks, but how does that work?

Linlin · Posted 07-27-2012 09:58 AM

RED parts using upcase

example:

data have;
input (a b c d)(:$2.);
cards;
aa bb cc dd
ab cd ef gr
;
proc sql noprint;
select name into :vars separated by ',' from
    dictionary.columns
   where libname='WORK' and memname='HAVE';
quit;
%put &vars;
data want;
length newvar $ 20;
set have;
newvar=cats(&vars);
proc print;run;
                             Obs     newvar     a     b     c     d

1 aabbccdd aa bb cc dd
2 abcdefgr ab cd ef gr

Peter_C · Posted 07-27-2012 03:57 PM

to cat all variables

try cats( of _all_ )

Here is a saslog snippet demonstrating

219 data demo ;

220 input abc $ num dum $ datex : date11. ;

221 format datex yymmdd10. ;

222 list;datalines;

RULE:
----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----

223 qwer 3456 dum 12DEc2012

224 mnbvc 9876543 dum 21dec12345

NOTE: The data set WORK.DEMO has 2 observations and 4 variables.

NOTE: DATA statement used (Total process time):

real time 0.01 seconds

cpu time 0.01 seconds

225 ;

226 data catstest( compress= yes);

227 set demo ;

228 length all $10000 ;

229 all = cats( of _all_ ) ;

230 put _all_ ;

231 run ;

abc=qwer num=3456 dum=dum datex=2012-12-12 all=qwer3456dum19339 _ERROR_=0 _N_=1

abc=mnbvc num=9876543 dum=dum datex=****-12-21 all=mnbvc9876543dum3793395 _ERROR_=0 _N_=2

NOTE: There were 2 observations read

Linlin · Posted 07-30-2012 09:03 AM

Thank you Peter! - LInlin

marinavdl · Posted 06-26-2017 08:02 AM

if i run all = cats( of _all_ ) ; on a subset of a dataset the code works perfectly fine, however, when I try to run it on an entire dataset (20 000+ records) the code fails with the error "An error occured when executing the workspace job program.sas. Server is disconnected"

Why would this happen and how do I get to run this on a large dataset?

I want to build MD5 checksums on large datasets without having to specify all the columns/attributes everytime. I am looking for something generic.

SAS Programming

Hashes, Checksums and all the rest...

Re: Hashes, Checksums and all the rest...

Hashes, Checksums and all the rest...

Hashes, Checksums and all the rest...

Re: Hashes, Checksums and all the rest...

Hashes, Checksums and all the rest...

Re: Hashes, Checksums and all the rest...

Hashes, Checksums and all the rest...

Re: Hashes, Checksums and all the rest...

Re: Hashes, Checksums and all the rest...

Re: Hashes, Checksums and all the rest...

Re: Hashes, Checksums and all the rest...

Re: Hashes, Checksums and all the rest...

Re: Hashes, Checksums and all the rest...

Re: Hashes, Checksums and all the rest...

Re: Hashes, Checksums and all the rest...

Encryption for data at rest in the SAS Viya platform

Tutorial: Getting Started With SAS Viya RESTful APIs

Using Postman Collections to Execute SAS REST APIs

Working with Masked Data in SAS Visual Investigator REST APIs (Part 3)

Working with Masked Data in SAS Visual Investigator REST APIs (Part 2)

Follow Us

What is...

SAS Programming

Our biggest data and AI event of the year.

SAS Training: Just a Click Away

Follow Us

What is...