BookmarkSubscribeRSS Feed
Yura2301
Quartz | Level 8

Hi All,

Didn't work for a long time with SAS, so would like to ask for advice regarding a tricky case.

So let's assume table A was migrated from SAS data set to some external DB, 1 to 1 migration.

After the migration, we'll need to compare if both tables are still the same, in the end should be simple answer "Yes" or "No", no need in detail report, for now.

Obviously, for such a case no sense in proc compare or like that, enough simple code like

select count(*)

from(

   select * from saslib.tableX

      except

   select * from DBLib.tableX

) as a

Plus reversal check like DBLib.tableX minus saslib.tableX.

 

Such way works fine for the majority of the tables, a few tables had issues with encoding - expected, but other and main issue - is decimals like "0,000000123".

So obviously each DB works with decimals in its own way, but what I would like to do(if possible) - set some global option at the beginning of SAS session that will say like:

Ignore very small decimals  - like if we have 0,0019 and 0,002 -  treat them as the same values, so except(minus) of such values will result in 0 rows.

Of course, we can do rounding for needed columns, etc., but generally - such comparison needs to be done for hundreds of tables(each has its own set of columns etc.), some of them are huge, so would like to rid - of any data manipulation before except(minus of those 2 tables).

Making HASH based on all columns looks like also is not an option for the case - anyway, those problematic decimals will make hash value different.

 

So long story short - would like to ask if there is some elegant way to force SAS to treat 0.0019 and 0,002 as the same values without any additional coding.

 

I Will be very thankful for any advice:)!

13 REPLIES 13
ballardw
Super User

You may want to investigate the Compare procedure. There are a couple of options available related to small differences. One you can set is Criterion. The default comparison for numeric values in the procedure is 0.00001 but you could make that smaller or larger though it would affect all the numeric comparisons and not variable by variable.

 

If you use Proc Compare you want to make sure the data is sorted in the same order.

The output can consist of data sets with a variety of different constructs so read the documentation. I would guess that you would likely not be interested in the normal printed output but only a data set with differences which could be accomplished with the NOPRINT out=<somedatasetname> and OUTNOEQUAL options. The last option would have the output data set empty if everything matches.

Yura2301
Quartz | Level 8

Hi ballardw,
Thanks for the quick reply!
Will take a look to proc compare more attentively.
Some of the tables that need to be compared - are really huge, - hundreds of millions of rows, tens of Gigabytes etc. so I'm not sure if proc compare will handle those Monsters at all, but for smaller and average tables -it can be an option:)
I also remember that proc compare does a lot of additional work that is not needed for the case, that's why I initially decided not to use it, but as you mentioned - some of the features can be like "turned off", so maybe it will speed-up comparing.

Thanks one more time!

SASKiwi
PROC Star

I've never had any problems using PROC COMPARE on large tables. It will slow down the more differences it finds but in your case there shouldn't be many if you configure it to ignore tiny differences.

Yura2301
Quartz | Level 8

Hi SASKiwi,

It seems like those data sets that need to be compared are not sorted the same way and we don't even have ID for the majority of the tables, for part of the tables PK can be extracted from descriptive portion, but not for all.

So proc compare with NOSORTED option and without ID, even don't sure it's possible compare that way.

ChrisNZ
Tourmaline | Level 20

Use options method=absolute criterion=0.001  for example.

Yura2301
Quartz | Level 8

Thanks!

Will use the options

Yura2301
Quartz | Level 8

Thanks everyone for advices!

Tried to use proc compare like:
proc compare base=saslib.tableX compare=DBLib.tableX NOPRINT method=absolute criterion=0.001 outnoequal out=noeqCols;
ID _ALL_ NOTSORTED;
run;

But seems like there is no way to use _ALL_ for the compare proc.

 

The problem is that for half of the tables there is no PK in the descriptive portion so not possible to use the ID statement, additionally seems like that tables sorted in a different way, so if sort both tables before comparing  - it will be too heavy I guess, some of the tables are tens of Gigabytes.

I planned for such huge tables test only first/last million of rows, but the issues with decimals remain the same.

 

Thanks!

ChrisNZ
Tourmaline | Level 20

> But seems like there is no way to use _ALL_ for the compare proc.

The id statement defines they key. The keys are not compared, they are matched.

id _ALL_ therefore makes no sense for proc compare.

 

> The problem is that for half of the tables there is no PK

Well you need to solve that. You should only compare observations that have a reason to match, shouldn't you?

Yura2301
Quartz | Level 8

Hi again Chris,

Thanks for the clarifications!

Generally, you are right, as I can see now  - proc compare extremely efficient if use it in the right way - with ID columns and sorted data sets etc.

But exactly in my case -  I know for sure that counts of tables match and they are sorted in diff. way and no defined Key etc.,

Cases I would like to "catch" are like encoding issues during the migration, dates/decimals issue etc.

What I already figured-out:

  • For small and AVG tables  - Except/Minus way works ok, it doesn't require defining PK for the table and so on. Interesting point - according to my tests except/minus works almost the same time as proc compare with ID - NoSort option.
  • For huge tables(from 5GB till 100GB in my case) - neither proc compare with ID nor Except doesn't perform.

Of course it depends on SAS configuration(CPU etc.), but in an actual situation using actual resources - the best way I see is split huge tables into smaller parts and compare part by part.

A lot of tables doesn't have PK as I wrote before, but all have at least 1 index, so my approximate plan is:

  1. Calculated on the fly for how many parts the table should be split so each part has around 1GB.
  2. Split table using most "distributed" column from the available index, usually it's some Surrogate Key, BIGInt.
  3. Compare each of that cut sub-tables in the loop(starting from the newest data), one by one.
  4. If there will be some difference - big probability it will be found on the first part, stop comparing after first founded diff.
  5. Logging on fly - so ech time it's visible that f.e. 40% of table X is processed already, no differences etc., so some trace file will be loaded before and after each step

Point two is the most interesting from tech side. Looks like if the table is indexed and even if it's huge  - the SAS code like

select distinct colA from tab1

Executes very quick if there is an index on colA, so having such "keys table" and having table size  - very easy split table on N approximately same size parts, so we'll know that part 1 subtable starts from colA value 10 and ends with colA value 12345 , part 2 starts with cola value 12346 and end with value ...

 

Generally, it seems like it was much easier move data from SAS to other DB then compare if data matched after the move:)

 

 

Yura2301
Quartz | Level 8

Hi again,

I guess I'll go next way - if simple minus/except doesn't match - will do except for all character and numeric columns without decimal part, if that part of tables match - that means issue is in that decimal columns, at least something:)

Set of columns will be possible extract using proc content/macro variables manipulations etc.:)

Alternative way just instead of select * - do select &column_LIST  and form &column_LIST based on data type - if decimal - round, if no -leave as it is, but offcourse that will impact on perfopmance

ChrisNZ
Tourmaline | Level 20

Another solution: Compare just one compound field. Something like:

data HAVE1 HAVE2;
  array A[50] $8;
  do I= 1 to 1e4;
    A50=put(I,8.); 
    output HAVE1;
    if I=1e3 then A50=put(I+1,8.); 
    output HAVE2;
  end;
  drop I;
run;

data COMP1/view=COMP1;
  length COMP $16;
  set HAVE1;
  COMP=md5(catx('|',of _ALL_));
run;

data COMP2/view=COMP2;
  length COMP $16;
  set HAVE2;
  COMP=md5(catx('|',of _ALL_));
run;

proc sql;
  select COMP from COMP1
  except 
  select COMP from COMP2;
quit; 

If you find mismatches, you can re-run the views and filter using the offending values.

You may have to tune how variable COMP is created to ensure that the wanted level of numeric precision is kept.

Also, ensure the variables are concatenated in the same order for both tables.

Yura2301
Quartz | Level 8

Hi again Chris,

That's a really good idea, maybe for a few biggest tables more efficient will be split by columns and not by rows and compare like col by col(or subset of columns by subset of columns using md5 hash, as you proposed).

 

 

Thanks so much for the advice!

ChrisNZ
Tourmaline | Level 20
You can compare all columns and all rows in one go if you hash all the fields as shown. And if you can easily sort the tables if needed on that one field.

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 13 replies
  • 1332 views
  • 5 likes
  • 4 in conversation