An Idea Exchange for SAS software and services

by Super User
on ‎10-31-2016 09:13 AM

Wouldn't this be an OS level thing rather than a SAS thing?  Just thinking that most of us have to store datasets, and there are various methods out there (database, dated folders, using a tool etc.).  Generally I would be more concerned about the contents of the file rather than how its stored, so a simple loop over the datasets, doing a proc compare - hard to say exactly as don't know how the data is soted - so long as that flags no problems then its not really relevant.  I would imagine your difference in file signature comes from one of the tables properties being modified on copy out,maybe change of modification date, or change to bit, or change in coding.  


Whilst a signature would be nice, wouln't it create a cyclic problem, i.e. your putting data into the file which contains the signature of the file which you can't get until that is embedded?  Hence why I would assume something like this to be done at the OS file system level rather than the SAS level.

by Frequent Contributor
on ‎10-31-2016 10:55 AM

I had exactly the same impulse as you : for me, initially signatures like md5 hashes were filesystem tools. That's why, foolishly, I calculated thousands of MD5 hashes on SAS datasets in order to find duplicates between the tables. When I compared the MD5 results to SAS ones (proc compare) on a sample of tables, however, this proved I was mistaken by and large : the same SAS datasets are stored as different files,(*.sas7bdat) and I have no SAS hashing function or SAS individual signatures, or external tool at OS level either to help me compare the tables in order to find copies ... hence my - rather unrealistic - proposal.


I don't fully understand the "cyclic problem" you mention : storing a signature key into the very item it is signing looks OK if 


1. you can ensure that any modification of the item (its content) automatically synchronises with the corresponding change of signature key


2. the scope of the signature must exclude the key itself, to prevent any self inconsistency


Obviously, this key cannot be part of the content of the SAS table, it's some kind of embedded metadata only - optional (like the Sort flag) and stored with the other attributes in the descriptor portion.

by Community Manager
on ‎10-31-2016 11:07 AM

I like your idea of a "data set signature" -- the question for debate would be which attributes of the data set conspire to create a unique data set.  You could implement something right now, based on your own criteria, using SAS extended attributes (SAS 9.4 and later).

by Frequent Contributor
on ‎10-31-2016 11:52 AM

 Yes, Chris, that's a good advice. I was considering this new facility for my purpose and I was even playing with the idea of coding the tools I need from scratch , following this nice paper by Rick Langston :


But I am mere mortal and that's out of my reach, unfortunately so that, out of laziness, I asked if someone else maybe could ?


The idea of Digital Signature is complex to implement : upon which 'space' ( scope ) do we compute the hash key ? It's definitively not obvious and the data portion of the dataset might not be the best candidate because it relies on page, boundaries which can vary from one copy to the next. 


One alternative, perhaps more feasible albeit less obvious, would be to base the computation upon the underlying SAS code itself (theoretically, it's related to the "Kolmogorov complexity" - see ) which can be standardized using this nice macro :


The Table Digital Signature could be defined as the Hash key based on the strings of symbols strictly between the datalines4 statement and the 4 semi-commas ending the data portion. One small problem thus becomes : - how to regenerate this high-level SAS code and feed this into any hashing algorithm without taking too much time ? ...


Another way to go, more arbitrary but perhaps more fruitful  might be to rely upon the compressing algorithm : if the algorithm is very stable (meaning, the compression ratio doesn't vary with OS or even runs ) then the ratio itself properly 'salted' might represent a unique index, assuming that two different string of datas that have strictly the same compression ratio is a rather rare event ...

Idea Statuses
Top Liked Authors