- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Dear all,
I am trying to find a quick way to check if a dataset (with all its columns and observations) is still looking the same as after a change to the code.
Let's say a dataset with 1mio observations and 1 row contains just an "a" in each observation.
Now I change something to my code (which in my case is some PRXPARSE action) and run the job again...I would expect that the result is the same.
How do I check that efficiently? How do I e.g. build a hash over everything a dataset contains...or a checksum?
Thanks a lot,
Thomas
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Since you are running the code anyways after your code change, something like this should work:
* first a little example data set with two variables;
data stuff;
infile cards;
input;
var1 = scan(_INFILE_,1);
var2 = scan(_INFILE_,2);
cards;
a b
a b
a c
a d
And the subsequent step with the checksum code;
data stuff(drop=check all);
format check $hex32.;
retain check;
set stuff end=eof;
* if _n_ = 3 then var1 = 'x';
all = cats(var1,var2,check);
check = md5(all);
if eof then put check=;
run;
First you create a variable called check with a format of hex 32. Then you retain it so it's value is available with each iteration of the data step. Then set your data set with the end= option so we know when we've reached the end. The commented out if statement represents your current data step code (and uncommenting it should change the checksum). At the bottom is the actual checksum code. First concatenate all the variables you are interested in checking into a variable called all. Notice we are including the check variable. Then use the md5() function to hash it. Finally if it's the end of the data set then write the checksum to the log. Run that once and note the checksum. Then uncomment the if statement and you will see the checksum change. The checksum is for all of the data set since each observation's checksum is built partly from the previous checksum.
Hope that helps! -stephen
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Update: I found this, which is quite helpful:
http://www2.sas.com/proceedings/sugi27/p088-27.pdf
Nonetheless, is there a PROC CHECKSUM or something like it, that would do the trick?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
If you are only interested in whether the files are the same, your fastest check might be to pipe your operating systems file compare routine. In windows that could be the fc command for which you can include the /LB1 option which (I think) will only allow one consecutive missmatch. Thus, if you don't receive the "no differences encountered" message, the files contain at least one difference.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
The problem with operating system file compare on SAS datasets is that datasets include metadata such as when it was created, operating system etc that is not really important in testing if the information contained in the two files are the same.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
What is changing in code? Do you expect that the new code will produce the exact same results?
Do you have access to the old data? then use PROC COMPARE.
Checksum concept is useful to testing when you do not have access to the original data. Such as when receiving a file or communication from someone.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Since you are running the code anyways after your code change, something like this should work:
* first a little example data set with two variables;
data stuff;
infile cards;
input;
var1 = scan(_INFILE_,1);
var2 = scan(_INFILE_,2);
cards;
a b
a b
a c
a d
And the subsequent step with the checksum code;
data stuff(drop=check all);
format check $hex32.;
retain check;
set stuff end=eof;
* if _n_ = 3 then var1 = 'x';
all = cats(var1,var2,check);
check = md5(all);
if eof then put check=;
run;
First you create a variable called check with a format of hex 32. Then you retain it so it's value is available with each iteration of the data step. Then set your data set with the end= option so we know when we've reached the end. The commented out if statement represents your current data step code (and uncommenting it should change the checksum). At the bottom is the actual checksum code. First concatenate all the variables you are interested in checking into a variable called all. Notice we are including the check variable. Then use the md5() function to hash it. Finally if it's the end of the data set then write the checksum to the log. Run that once and note the checksum. Then uncomment the if statement and you will see the checksum change. The checksum is for all of the data set since each observation's checksum is built partly from the previous checksum.
Hope that helps! -stephen
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Tom speaks the truth on this one:
Tom wrote:
The problem with operating system file compare on SAS datasets is that datasets include metadata such as when it was created, operating system etc that is not really important in testing if the information contained in the two files are the same.
and I agree with his recommendation to use proc compare. I like sascoders concept, I would not recommend using this approach for a table with many different data types though for a simple table it is interesting:
data way1;
informat v1-v2 $1.;
input v1-v2;
hash=put(md5(cats(of v1 v2)),$hex32.);
cards;
a b
c d
e f
g h
;
run;
data way2;
input v1 $ v2 $;
hash=put(md5(cats(v1,v2)),$hex32.);
cards;
a b
c d
e f
g h
;
run;
data _null_;
if 0 then set way1;
declare hash w1(dataset:'way1');
w1.definekey('hash');
w1.definedone();
declare hiter w1iter('w1');
declare hash w2(dataset:'way2');
w2.definekey('hash');
w2.definedone();
do while(w1iter.next()=0);
if w2.find() ne 0 then
do;
putlog 'ERROR: These datasets are different';
bad+1;
end;
end;
if bad=0 then put 'NOTE: These datasets are the same';
stop;
run;
NOTE: These datasets are the same
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I'm not sure why you would say it is not good for a table with many different data types? (Only two data types in SAS). And why the subsequent step creating hashes to compare? There is no need. In my code, the hash (hash in the sense that it is a unique fingerprint, not a hash like associative array) at the end of the data step is unique to the entire data step, not just per observation. Again, notice that the result of the hash of the previous observation is fed into the hashing of the current observation in my code. Resulting in a unique hash fingerprint for the entire data step. In one pass. Comparing this final value to the original on subsequent runs of the data step would tell you if _any_ of the observations changed. I believe this is what the original poster asked for.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi all,
I just browsed through my old requests and I totally forgot to thank you for your help! So here it is: Thank you!
I have used this method for quite a while now - works like a charm. Is there a way to cats() all variables of a dataset without listing each column? Gets quite heavy with >200 columns.
Thomas
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
you can put all your variables in a macro variable.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
thanks, but how does that work?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
RED parts using upcase
example:
data have;
input (a b c d)(:$2.);
cards;
aa bb cc dd
ab cd ef gr
;
proc sql noprint;
select name into :vars separated by ',' from
dictionary.columns
where libname='WORK' and memname='HAVE';
quit;
%put &vars;
data want;
length newvar $ 20;
set have;
newvar=cats(&vars);
proc print;run;
Obs newvar a b c d
1 aabbccdd aa bb cc dd
2 abcdefgr ab cd ef gr
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
to cat all variables
try cats( of _all_ )
Here is a saslog snippet demonstrating
219 data demo ;
220 input abc $ num dum $ datex : date11. ;
221 format datex yymmdd10. ;
222 list;datalines;
RULE:
----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----
223 qwer 3456 dum 12DEc2012
224 mnbvc 9876543 dum 21dec12345
NOTE: The data set WORK.DEMO has 2 observations and 4 variables.
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds
225 ;
226 data catstest( compress= yes);
227 set demo ;
228 length all $10000 ;
229 all = cats( of _all_ ) ;
230 put _all_ ;
231 run ;
abc=qwer num=3456 dum=dum datex=2012-12-12 all=qwer3456dum19339 _ERROR_=0 _N_=1
abc=mnbvc num=9876543 dum=dum datex=****-12-21 all=mnbvc9876543dum3793395 _ERROR_=0 _N_=2
NOTE: There were 2 observations read
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thank you Peter! - LInlin
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
if i run all = cats( of _all_ ) ; on a subset of a dataset the code works perfectly fine, however, when I try to run it on an entire dataset (20 000+ records) the code fails with the error "An error occured when executing the workspace job program.sas. Server is disconnected"
Why would this happen and how do I get to run this on a large dataset?
I want to build MD5 checksums on large datasets without having to specify all the columns/attributes everytime. I am looking for something generic.