Hi All,
Didn't work for a long time with SAS, so would like to ask for advice regarding a tricky case.
So let's assume table A was migrated from SAS data set to some external DB, 1 to 1 migration.
After the migration, we'll need to compare if both tables are still the same, in the end should be simple answer "Yes" or "No", no need in detail report, for now.
Obviously, for such a case no sense in proc compare or like that, enough simple code like
select count(*)
from(
select * from saslib.tableX
except
select * from DBLib.tableX
) as a
Plus reversal check like DBLib.tableX minus saslib.tableX.
Such way works fine for the majority of the tables, a few tables had issues with encoding - expected, but other and main issue - is decimals like "0,000000123".
So obviously each DB works with decimals in its own way, but what I would like to do(if possible) - set some global option at the beginning of SAS session that will say like:
Ignore very small decimals - like if we have 0,0019 and 0,002 - treat them as the same values, so except(minus) of such values will result in 0 rows.
Of course, we can do rounding for needed columns, etc., but generally - such comparison needs to be done for hundreds of tables(each has its own set of columns etc.), some of them are huge, so would like to rid - of any data manipulation before except(minus of those 2 tables).
Making HASH based on all columns looks like also is not an option for the case - anyway, those problematic decimals will make hash value different.
So long story short - would like to ask if there is some elegant way to force SAS to treat 0.0019 and 0,002 as the same values without any additional coding.
I Will be very thankful for any advice:)!
You may want to investigate the Compare procedure. There are a couple of options available related to small differences. One you can set is Criterion. The default comparison for numeric values in the procedure is 0.00001 but you could make that smaller or larger though it would affect all the numeric comparisons and not variable by variable.
If you use Proc Compare you want to make sure the data is sorted in the same order.
The output can consist of data sets with a variety of different constructs so read the documentation. I would guess that you would likely not be interested in the normal printed output but only a data set with differences which could be accomplished with the NOPRINT out=<somedatasetname> and OUTNOEQUAL options. The last option would have the output data set empty if everything matches.
Hi ballardw,
Thanks for the quick reply!
Will take a look to proc compare more attentively.
Some of the tables that need to be compared - are really huge, - hundreds of millions of rows, tens of Gigabytes etc. so I'm not sure if proc compare will handle those Monsters at all, but for smaller and average tables -it can be an option:)
I also remember that proc compare does a lot of additional work that is not needed for the case, that's why I initially decided not to use it, but as you mentioned - some of the features can be like "turned off", so maybe it will speed-up comparing.
Thanks one more time!
I've never had any problems using PROC COMPARE on large tables. It will slow down the more differences it finds but in your case there shouldn't be many if you configure it to ignore tiny differences.
Hi SASKiwi,
It seems like those data sets that need to be compared are not sorted the same way and we don't even have ID for the majority of the tables, for part of the tables PK can be extracted from descriptive portion, but not for all.
So proc compare with NOSORTED option and without ID, even don't sure it's possible compare that way.
Use options method=absolute criterion=0.001 for example.
Thanks!
Will use the options
Thanks everyone for advices!
Tried to use proc compare like:
proc compare base=saslib.tableX compare=DBLib.tableX NOPRINT method=absolute criterion=0.001 outnoequal out=noeqCols;
ID _ALL_ NOTSORTED;
run;
But seems like there is no way to use _ALL_ for the compare proc.
The problem is that for half of the tables there is no PK in the descriptive portion so not possible to use the ID statement, additionally seems like that tables sorted in a different way, so if sort both tables before comparing - it will be too heavy I guess, some of the tables are tens of Gigabytes.
I planned for such huge tables test only first/last million of rows, but the issues with decimals remain the same.
Thanks!
> But seems like there is no way to use _ALL_ for the compare proc.
The id statement defines they key. The keys are not compared, they are matched.
id _ALL_ therefore makes no sense for proc compare.
> The problem is that for half of the tables there is no PK
Well you need to solve that. You should only compare observations that have a reason to match, shouldn't you?
Hi again Chris,
Thanks for the clarifications!
Generally, you are right, as I can see now - proc compare extremely efficient if use it in the right way - with ID columns and sorted data sets etc.
But exactly in my case - I know for sure that counts of tables match and they are sorted in diff. way and no defined Key etc.,
Cases I would like to "catch" are like encoding issues during the migration, dates/decimals issue etc.
What I already figured-out:
Of course it depends on SAS configuration(CPU etc.), but in an actual situation using actual resources - the best way I see is split huge tables into smaller parts and compare part by part.
A lot of tables doesn't have PK as I wrote before, but all have at least 1 index, so my approximate plan is:
Point two is the most interesting from tech side. Looks like if the table is indexed and even if it's huge - the SAS code like
select distinct colA from tab1
Executes very quick if there is an index on colA, so having such "keys table" and having table size - very easy split table on N approximately same size parts, so we'll know that part 1 subtable starts from colA value 10 and ends with colA value 12345 , part 2 starts with cola value 12346 and end with value ...
Generally, it seems like it was much easier move data from SAS to other DB then compare if data matched after the move:)
Hi again,
I guess I'll go next way - if simple minus/except doesn't match - will do except for all character and numeric columns without decimal part, if that part of tables match - that means issue is in that decimal columns, at least something:)
Set of columns will be possible extract using proc content/macro variables manipulations etc.:)
Alternative way just instead of select * - do select &column_LIST and form &column_LIST based on data type - if decimal - round, if no -leave as it is, but offcourse that will impact on perfopmance
Another solution: Compare just one compound field. Something like:
data HAVE1 HAVE2;
array A[50] $8;
do I= 1 to 1e4;
A50=put(I,8.);
output HAVE1;
if I=1e3 then A50=put(I+1,8.);
output HAVE2;
end;
drop I;
run;
data COMP1/view=COMP1;
length COMP $16;
set HAVE1;
COMP=md5(catx('|',of _ALL_));
run;
data COMP2/view=COMP2;
length COMP $16;
set HAVE2;
COMP=md5(catx('|',of _ALL_));
run;
proc sql;
select COMP from COMP1
except
select COMP from COMP2;
quit;
If you find mismatches, you can re-run the views and filter using the offending values.
You may have to tune how variable COMP is created to ensure that the wanted level of numeric precision is kept.
Also, ensure the variables are concatenated in the same order for both tables.
Hi again Chris,
That's a really good idea, maybe for a few biggest tables more efficient will be split by columns and not by rows and compare like col by col(or subset of columns by subset of columns using md5 hash, as you proposed).
Thanks so much for the advice!
Available on demand!
Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.