BookmarkSubscribeRSS Feed

Hi Everyone,

 

I am currently developing some basic routines to find multiple copies  (some 'd say duplicates ) between SAS Datasets, not inside.

Here is one possible blueprint for solving this issue.

 

Problem situation : cloning is not universal

 

We have tens of thousands of SAS tables (V9) with many candidates for master + n? copies (usually 1 master +  1 copy, sometimes 2 or more copies). As far as I know, the tables were copied using Proc Copy - unfortunately - not any system tool (cp, copy) which would have saved us - the admins - precious time. Alas, Proc Copy even with clone option doesn't provide a bit-perfect copy : try it for yourself, a sas table copied with Proc Copy is not like a file cloned at filesystem level :

 

=> md5 hash keys for two *.sas7bdat files are different even if Proc Compare displays exact identity between the 2 tables.

 

a SAS dataset is cloned with reference to its content 'sas-wise' : descriptor (attributes values) + data, whereas a file is cloned as regards its sequential bits at block level 'storage-wise'. The page bounds defined at SAS level might not be exactly equal between the two copies, I suspect, but that doesn't matter for SAS Engine (V9).

Even in the case of having two data portions are exactly equal, if some table attributes differ in any sense (e g a variable name, label or format/informat etc.) then we should declare the two copies unequal, obviously.

 

New tool : an optional Table Digital Signature

 

Considering the lack of tools at my disposal, I am suggesting to provide a new feature for SAS dataset (V9, CAS) :

 

let's call it, for instance, a Table Digital Signature TDS. A question of definition immediately arises :

 

- is this new key to assess unique Tables comprising Descriptor + Observations ?

-  or Observations only ?

 

The latter looks more useful to me in practice.

 

How - and When - to compute the TDS

 

The TDS would be based upon some hashing algorithm ; sequence of n bits { bit, bit ... , bit } => 1 Hash key like MD5 or SH1, or some other non deterministic tool like a Bloom Filter, for instance, in order to scale up appropriately  ('linearly') the extra computation time needed : a TDS computed at O(n^2) or, at any rate - beyond O(log n), for instance, would be utterly unusable in practice on large datasets.

 

Usually, a SAS engine is used sequentially from to top to bottom against a SAS dataset (a data set OBS option is thus an exception) and random or direct access (SET instruction with Point) is rather the exception than the rule : why not use this complete pass through the data to compute the TDS with an efficient single-pass (multi-threaded) algorithm ?

 

Another idea to allocate the TDS computation time adequately might be to add it to the Compress/Uncompress time (if the Data set is compressed, of course) : in a perfect world, this extra stage could cost only some marginal fraction of elapsed time.

 

Major difficulty : how to reconcile TDS with partial modifications ?

 

Like the cost of maintaining an index, any partial (in-place) modification of the dataset :

 

- Proc Datasets + Modifiy

- Data + If + Output

- Proc Append

etc.

 

would require a recalculation of the TDS. Therefore, adding some non marginal extra time to the modification 

if the modified part is small compared to the overall table size : is there some associative hashing function already available to take care of this ?

By associative, here (the exact technical term might be different), I mean some strong property like :

 

Hash(Big Part of table A) + Hash(Small Part of table A) = Hash(full table A)  

 

See https://www.quora.com/Are-there-any-strong-associative-hash-mixing-functions

 

How to store and retrieve the TDS value

 

Like the Sort flag, the TDS could be stored ('embedded') in the table Descriptor portion and displayed with Proc Contents and Dictionary View (Vtables) if available.

 

How to enable/disable the optional feature

 

Of course, there should exist some System Option to Enable/Disable the TDS calculation at session/library/table level.

 

 Afterthought : clone signatures aren't perfect

 

A minor trouble with this kind of feature would be the legal status of the TDS calculation : if false positive (or negative) errors are in theory > 0 (eg hash collisions probability > 0) then this would restrict the usage of such TDS to some kind of only partial (non conclusive) evidence. Taking this into account, I am quite confident this would definitely improve our SAS usage no matter what ! 🙂

 

If you've liked the idea, thanks for your support

4 Comments
RW9
Diamond | Level 26
Diamond | Level 26

Wouldn't this be an OS level thing rather than a SAS thing?  Just thinking that most of us have to store datasets, and there are various methods out there (database, dated folders, using a tool etc.).  Generally I would be more concerned about the contents of the file rather than how its stored, so a simple loop over the datasets, doing a proc compare - hard to say exactly as don't know how the data is soted - so long as that flags no problems then its not really relevant.  I would imagine your difference in file signature comes from one of the tables properties being modified on copy out,maybe change of modification date, or change to bit, or change in coding.  

 

Whilst a signature would be nice, wouln't it create a cyclic problem, i.e. your putting data into the file which contains the signature of the file which you can't get until that is embedded?  Hence why I would assume something like this to be done at the OS file system level rather than the SAS level.

ronan
Lapis Lazuli | Level 10

I had exactly the same impulse as you : for me, initially signatures like md5 hashes were filesystem tools. That's why, foolishly, I calculated thousands of MD5 hashes on SAS datasets in order to find duplicates between the tables. When I compared the MD5 results to SAS ones (proc compare) on a sample of tables, however, this proved I was mistaken by and large : the same SAS datasets are stored as different files,(*.sas7bdat) and I have no SAS hashing function or SAS individual signatures, or external tool at OS level either to help me compare the tables in order to find copies ... hence my - rather unrealistic - proposal.

 

I don't fully understand the "cyclic problem" you mention : storing a signature key into the very item it is signing looks OK if 

 

1. you can ensure that any modification of the item (its content) automatically synchronises with the corresponding change of signature key

 

2. the scope of the signature must exclude the key itself, to prevent any self inconsistency

 

Obviously, this key cannot be part of the content of the SAS table, it's some kind of embedded metadata only - optional (like the Sort flag) and stored with the other attributes in the descriptor portion.

ChrisHemedinger
Community Manager

I like your idea of a "data set signature" -- the question for debate would be which attributes of the data set conspire to create a unique data set.  You could implement something right now, based on your own criteria, using SAS extended attributes (SAS 9.4 and later).

ronan
Lapis Lazuli | Level 10

 Yes, Chris, that's a good advice. I was considering this new facility for my purpose and I was even playing with the idea of coding the tools I need from scratch , following this nice paper by Rick Langston : 

 

https://support.sas.com/resources/papers/proceedings16/SAS2800-2016.pdf

 

But I am mere mortal and that's out of my reach, unfortunately so that, out of laziness, I asked if someone else maybe could ?

 

The idea of Digital Signature is complex to implement : upon which 'space' ( scope ) do we compute the hash key ? It's definitively not obvious and the data portion of the dataset might not be the best candidate because it relies on page, boundaries which can vary from one copy to the next. 

 

One alternative, perhaps more feasible albeit less obvious, would be to base the computation upon the underlying SAS code itself (theoretically, it's related to the "Kolmogorov complexity" - see https://mitpress.mit.edu/sites/default/files/titles/content/9780262072625_sch_0001.pdf ) which can be standardized using this nice macro :

 

http://blogs.sas.com/content/sastraining/2016/03/11/jedi-sas-tricks-data-to-data-step-macro/

 

The Table Digital Signature could be defined as the Hash key based on the strings of symbols strictly between the datalines4 statement and the 4 semi-commas ending the data portion. One small problem thus becomes : - how to regenerate this high-level SAS code and feed this into any hashing algorithm without taking too much time ? ...

 

Another way to go, more arbitrary but perhaps more fruitful  might be to rely upon the compressing algorithm : if the algorithm is very stable (meaning, the compression ratio doesn't vary with OS or even runs ) then the ratio itself properly 'salted' might represent a unique index, assuming that two different string of datas that have strictly the same compression ratio is a rather rare event ...