Hi again Chris,
Thanks for the clarifications!
Generally, you are right, as I can see now - proc compare extremely efficient if use it in the right way - with ID columns and sorted data sets etc.
But exactly in my case - I know for sure that counts of tables match and they are sorted in diff. way and no defined Key etc.,
Cases I would like to "catch" are like encoding issues during the migration, dates/decimals issue etc.
What I already figured-out:
For small and AVG tables - Except/Minus way works ok, it doesn't require defining PK for the table and so on. Interesting point - according to my tests except/minus works almost the same time as proc compare with ID - NoSort option.
For huge tables(from 5GB till 100GB in my case) - neither proc compare with ID nor Except doesn't perform.
Of course it depends on SAS configuration(CPU etc.), but in an actual situation using actual resources - the best way I see is split huge tables into smaller parts and compare part by part.
A lot of tables doesn't have PK as I wrote before, but all have at least 1 index, so my approximate plan is:
Calculated on the fly for how many parts the table should be split so each part has around 1GB.
Split table using most "distributed" column from the available index, usually it's some Surrogate Key, BIGInt.
Compare each of that cut sub-tables in the loop(starting from the newest data), one by one.
If there will be some difference - big probability it will be found on the first part, stop comparing after first founded diff.
Logging on fly - so ech time it's visible that f.e. 40% of table X is processed already, no differences etc., so some trace file will be loaded before and after each step
Point two is the most interesting from tech side. Looks like if the table is indexed and even if it's huge - the SAS code like
select distinct colA from tab1
Executes very quick if there is an index on colA, so having such "keys table" and having table size - very easy split table on N approximately same size parts, so we'll know that part 1 subtable starts from colA value 10 and ends with colA value 12345 , part 2 starts with cola value 12346 and end with value ...
Generally, it seems like it was much easier move data from SAS to other DB then compare if data matched after the move:)
... View more