Re: Compare two similar data sets and find the difference

learn_SAS_23 · Posted 09-07-2018 04:55 AM

Hi team,

i am looking for any automated solution to comapare two large data which contain many variables , and print the results in excel by highliting the differences.

Thanks in advance.

RW9 · Posted 09-07-2018 05:01 AM

Proc compare is used for comparing two datasets. This doesn't however give the output you want exactly so you will have to manipulate the output from that into what ever it is you want.

Another way is to do:

proc sql;
  create table out as
  select *
  from   table1
  except 
  select * 
  from   table2
  union all
  select * 
  from   table2
  except 
  select *
  from   table1;
quit;

Or something along those lines (this just shows rows which are different between the two, you would then need to write code to do highlighting and such like.

I would say proc compare output is the simplest to get and probably most detailed, but it isn't the output you want, which is very specific to you and hence probably no specific tool.

A final option, export to CSV and use a text compare tool maybe? Seems a bit of a faff just to get some highlighted output though.

learn_SAS_23 · Posted 09-07-2018 09:24 AM

Hi Team,

we have almost 20 tables which is a SCD TYPE2 with more than 250 columns in each .i need to find the difference of old and new records in each table and high lisght the differences in excel.

To break the requirement

1. i took a back of table 1 before updating the table with new file let say old_table with 2 rows

Old_table :

id name city processed_date

1 Scott sweden 26/07/2018

2 Scott America 04/07/2018

2. After running a job , my table will get updated with new values which is type 2 SCD stores in different rows . So i copy the updated .

New_table :

id name city processed_date

1 Scott Finland 18/09/2018

2 Scott America 04/07/2018

3. Now i rename the columns in old data such as (by using rename macro)

Old_table :

id name_0ld city_old processed_date_old

1 Scott sweden 26/07/2018

4. Now i need to get my output , by merging old and new data set with only mismath rows or updated data with highlighted differences

id name_old name city_old city processed_date_old processed_date

1 Scott Scott Sweden Finland 26/07/2018 18/09/2018

i can achieve this solution by data merge if i have limited columns , but i want to automate this for large data sets which have multiple columns.

Can some one help , with a generic code to compare and highlight the difference records as descibed above.

Thanks in advance

VDD · Posted 09-07-2018 09:28 AM

Have you looked at using Proc Compare?

http://support.sas.com/documentation/cdl/en/proc/65145/HTML/default/viewer.htm#n1f136xdchlfrtn1663tm...

learn_SAS_23 · Posted 09-07-2018 09:32 AM

yes, i tried with proc compare The drawback of the procedure Proc Compare is its lengthy output if mismatches do exist. It is very
time consuming and labor extensive to find the information of interest, or pinpoint the exact changes in the dataset.

Reeza · Posted 09-07-2018 11:00 AM

I don't have a generic solution, but it's not that difficult to make one. If you can do it for one full data set and post that, you can then wrap it in a macro.

The reason I don' thave a generic solution is the requirements change slightly every time.

For example, I do have several generic macros that will report which variables are in which data sets which I more commonly need but for a dim change like this I would recommend a new macro.

Or you can parse PROC COMPARE. It really depends on what you want but there are several different OUT options that let you limit the output as desired.

@learn_SAS_23 wrote:
yes, i tried with proc compare The drawback of the procedure Proc Compare is its lengthy output if mismatches do exist. It is very
time consuming and labor extensive to find the information of interest, or pinpoint the exact changes in the dataset.

learn_SAS_23 · Posted 09-10-2018 02:03 AM

Thanks for giving me such confidence , please see below comments and suggest your opinion

ballardw · Posted 09-07-2018 11:39 AM

@learn_SAS_23 wrote:

Hi Team,

we have almost 20 tables which is a SCD TYPE2 with more than 250 columns in each .i need to find the difference of old and new records in each table and high lisght the differences in excel.

To break the requirement

1. i took a back of table 1 before updating the table with new file let say old_table with 2 rows

Old_table :

id name city processed_date

1 Scott sweden 26/07/2018

2 Scott America 04/07/2018



2. After running a job , my table will get updated with new values which is type 2 SCD stores in different rows . So i copy the updated .

New_table :

id name city processed_date

    1 Scott Finland 18/09/2018

2 Scott America 04/07/2018

3. Now i rename the columns in old data such as (by using rename macro)

Old_table :

id name_0ld city_old processed_date_old

1 Scott sweden 26/07/2018

4. Now i need to get my output , by merging old and new data set with only mismath rows or updated data with highlighted differences

id   name_old name city_old city processed_date_old   processed_date

    1 Scott Scott Sweden   Finland 26/07/2018 18/09/2018

i can achieve this solution by data merge if i have limited columns , but i want to automate this for large data sets which have multiple columns.

Can some one help , with a generic code to compare and highlight the difference records as descibed above.

Thanks in advance

You say you have 20 data sets to compare. What do you expect this to look like after comparing all 20 of them?

It almost sounds like you would have 20 versions of each variable??

learn_SAS_23 · Posted 09-10-2018 01:54 AM

i have 20 different data sets (of old and new data ) with different variables which contain approximetly 200 columns in each table , when i say table i had a old data in table_old_1 , and updated new data in table_new_1 , I need to merge the data and compare each and every column .

like i explained in my example

id name_old name city_old city processed_date_old processed_date
1 Scott Scott Sweden Finland 26/07/2018 18/09/2018

if i can use merge

data Merged;
merge Table_old_1 (in = x) Table_new_1(in = y);
by id;

if x = 1 and y = 1;
run;
proc sql;
create table MISMATCH as
select a.id,a.name_old,a.name_new,
a.city_old , b.city_new,
a.processed_date_old,b.processed_date_new,

from Merged a, Merged b
where a.id=b.id
and ( (a.name_old ne b.name_new) or
(a.city_old ne b.city_new) or
(a.processed_date_old ne b.processed_date_new) )
;quit;

in this code snippet i did for 3 columns, but i have to create a mismatch data set for 200 columns, can this can be done through macro , any suggestions please

Reeza · Posted 09-10-2018 10:48 AM

@learn_SAS_23 wrote:

in this code snippet i did for 3 columns, but i have to create a mismatch data set for 200 columns, can this can be done through macro , any suggestions please

Yes it can, but if you need this, PROC COMPARE really does a good job of doing that automatically. You have to do multiple checks though - one for if a record is present or not in the first place, one if it's different and what happens, if possible, if you have a key variable that is duplicated for some reason, which record would you check.

Otherwise to write a macro I'd start with:

1. Get list of variables from data set, minus KEY joining variables.

2. Make a RENAME statement so that all variables are renamed automatically (add old/new prefix to one of the data sets). You show this in your example but you need to automate it.

3. Merge using the KEY variable.

4. Set up two/four arrays and loop through checking for differences, two if you have all variables the same type, 4 if you have both numeric and characters that you need to compare.

5. Print report of differences

Make this a macro using the Key variables to join on and an output data set name.

learn_SAS_23 · Posted 09-10-2018 08:52 PM

Thanks reeza,

Can you please send a sample code snippet for point number 4, using 4arrays
to verify data, am just learning sas

I made code for other points (1,2,3,5)

Reeza · Posted 09-10-2018 10:50 PM

data want;
set have;

array old_char (*) $ <list of character variables>;
array old_num(*) <list of numeric variables>;
array new_char(*) $ <list of new names>;
array new_num(*)  <list of new names>;

array diff_char(*) $ diffC1-diffCn; *you may want to make new names here;
array diff_num(*) diffN1-diffNn; 

*loop over character;
do i=1 to dim (old_char);
    diff_char(i) = old_char(i) = new_char(i);
end;

*loop over numeric;
do i=1 to dim (old_num);
    diff_num(i) = old_num(i) - new_num(i);
end;

run;

Sure, but please post back your macro so others can use it as well 😉

I ended up being wrong, you need 6 arrays, one for old/new/diff x 2 - one for character and numeric.

There should be an easier way.

Here's a PROC COMPARE example. You could transpose this output data set and get what you want I think and it's easier to wrap that in a macro program IMO.

data class_new;
set sashelp.class;
if age in (12, 13, 14) then height = height*2;
if age = 11 then delete;
run;

proc sort data=class_new;
by name; run;

proc sort data=sashelp.class out=class_old;
by name;run;

proc compare base=class_new compare=class_old outall out=want;
run;

@learn_SAS_23 wrote:
Thanks reeza,

Can you please send a sample code snippet for point number 4, using 4arrays
to verify data, am just learning sas

I made code for other points (1,2,3,5)

comapre two identical data sets with many variables and hightligt the differences in excel.

Re: comapre two identical data sets with many variables and hightligt the differences in excel.

Compare two similar data sets and find the difference