Hi Everyone,
I am currently developing some basic routines to find multiple copies (some 'd say duplicates ) between SAS Datasets, not inside.
Here is one possible blueprint for solving this issue.
Problem situation : cloning is not universal
We have tens of thousands of SAS tables (V9) with many candidates for master + n? copies (usually 1 master + 1 copy, sometimes 2 or more copies). As far as I know, the tables were copied using Proc Copy - unfortunately - not any system tool (cp, copy) which would have saved us - the admins - precious time. Alas, Proc Copy even with clone option doesn't provide a bit-perfect copy : try it for yourself, a sas table copied with Proc Copy is not like a file cloned at filesystem level :
=> md5 hash keys for two *.sas7bdat files are different even if Proc Compare displays exact identity between the 2 tables.
a SAS dataset is cloned with reference to its content 'sas-wise' : descriptor (attributes values) + data, whereas a file is cloned as regards its sequential bits at block level 'storage-wise'. The page bounds defined at SAS level might not be exactly equal between the two copies, I suspect, but that doesn't matter for SAS Engine (V9).
Even in the case of having two data portions are exactly equal, if some table attributes differ in any sense (e g a variable name, label or format/informat etc.) then we should declare the two copies unequal, obviously.
New tool : an optional Table Digital Signature
Considering the lack of tools at my disposal, I am suggesting to provide a new feature for SAS dataset (V9, CAS) :
let's call it, for instance, a Table Digital Signature TDS. A question of definition immediately arises :
- is this new key to assess unique Tables comprising Descriptor + Observations ?
- or Observations only ?
The latter looks more useful to me in practice.
How - and When - to compute the TDS
The TDS would be based upon some hashing algorithm ; sequence of n bits { bit, bit ... , bit } => 1 Hash key like MD5 or SH1, or some other non deterministic tool like a Bloom Filter, for instance, in order to scale up appropriately ('linearly') the extra computation time needed : a TDS computed at O(n^2) or, at any rate - beyond O(log n), for instance, would be utterly unusable in practice on large datasets.
Usually, a SAS engine is used sequentially from to top to bottom against a SAS dataset (a data set OBS option is thus an exception) and random or direct access (SET instruction with Point) is rather the exception than the rule : why not use this complete pass through the data to compute the TDS with an efficient single-pass (multi-threaded) algorithm ?
Another idea to allocate the TDS computation time adequately might be to add it to the Compress/Uncompress time (if the Data set is compressed, of course) : in a perfect world, this extra stage could cost only some marginal fraction of elapsed time.
Major difficulty : how to reconcile TDS with partial modifications ?
Like the cost of maintaining an index, any partial (in-place) modification of the dataset :
- Proc Datasets + Modifiy
- Data + If + Output
- Proc Append
etc.
would require a recalculation of the TDS. Therefore, adding some non marginal extra time to the modification
if the modified part is small compared to the overall table size : is there some associative hashing function already available to take care of this ?
By associative, here (the exact technical term might be different), I mean some strong property like :
Hash(Big Part of table A) + Hash(Small Part of table A) = Hash(full table A)
See https://www.quora.com/Are-there-any-strong-associative-hash-mixing-functions
How to store and retrieve the TDS value
Like the Sort flag, the TDS could be stored ('embedded') in the table Descriptor portion and displayed with Proc Contents and Dictionary View (Vtables) if available.
How to enable/disable the optional feature
Of course, there should exist some System Option to Enable/Disable the TDS calculation at session/library/table level.
Afterthought : clone signatures aren't perfect
A minor trouble with this kind of feature would be the legal status of the TDS calculation : if false positive (or negative) errors are in theory > 0 (eg hash collisions probability > 0) then this would restrict the usage of such TDS to some kind of only partial (non conclusive) evidence. Taking this into account, I am quite confident this would definitely improve our SAS usage no matter what ! 🙂
If you've liked the idea, thanks for your support
... View more