compare

Solved
Frequent Contributor
Posts: 87

compare

how can I use proc compare for datasets that have different number of records?

Example: table A has 5000 records; table B has 5025 records.I want to spit out the 25 records that do not exist in table A and also any other differences.

Thanks

Accepted Solutions
Solution
‎08-15-2013 07:19 PM
Frequent Contributor
Posts: 129

Re: compare

So does your output look something like the following.  I created a 10 record dataset and made a copy of it.  In the copy I deleted the fifth record  and changed the values of the non-id variable on the last five variables.

The COMPARE Procedure
Comparison of WORK.X with WORK.Y
(Method=EXACT)

Data Set Summary

Dataset           Created          Modified  NVar    NObs

WORK.X   15AUG13:17:47:52  15AUG13:17:47:52     2      10
WORK.Y   15AUG13:18:10:07  15AUG13:18:10:07     2       9

Variables Summary

Number of Variables in Common: 2.
Number of ID Variables: 1.

Observation Summary

Observation      Base  Compare  ID

First Obs           1        1  i=1
First Unequal       6        5  i=6
Last  Unequal      10        9  i=10
Last  Obs          10        9  i=10

Number of Observations in Common: 9.
Number of Observations in WORK.X but not in WORK.Y: 1.
Total Number of Observations Read from WORK.X: 10.
Total Number of Observations Read from WORK.Y: 9.

Number of Observations with Some Compared Variables Unequal: 5.
Number of Observations with All Compared Variables Equal: 4.

Values Comparison Summary

Number of Variables Compared with All Observations Equal: 0.
Number of Variables Compared with Some Observations Unequal: 1.
Total Number of Values which Compare Unequal: 5.
Maximum Difference: 10.

Comparison of WORK.X with WORK.Y
(Method=EXACT)

All Variables Compared have Unequal Values

Variable  Type  Len  Ndif   MaxDif

j         NUM     8     5   10.000

Value Comparison Results for Variables

_________________________________________________________
||       Base    Compare
i  ||          j          j      Diff.     % Diff
_______  ||  _________  _________  _________  _________
||
6  ||   120.0000   130.0000    10.0000     8.3333
7  ||   140.0000   150.0000    10.0000     7.1429
8  ||   160.0000   170.0000    10.0000     6.2500
9  ||   180.0000   190.0000    10.0000     5.5556
10  ||   200.0000   210.0000    10.0000     5.0000
_________________________________________________________

Take a look at the output dataset options from proc compare and see if there is an output datset which you could post process to make a more compact report.  The raw output from proc compare can be a little overwhelming when there are many data discrepencies.

All Replies
Super User
Posts: 3,927

Re: compare

Proc Compare works in two ways:

1) Compares record for record between two tables (1 with 1, 2 with 2 etc)

2) Compares records using key variables common in both tables that uniquely identify which record in the first table is to be compared with which record in the second table.

So do your tables A and B contain one of more key variables that will uniquely identify a record?

If so use proc compare with an ID statement using these variables.

Frequent Contributor
Posts: 129

Re: compare

I like to use ID variables for this.  If you have a set of columns which uniquely identifies each record in both data sets, then ID statement lets proc compare handle the missing records appropriately.  So for example with two datasets with unique personids on each record and where the second data set contains 25 records not found on the first, this will workL

proc compare data=one

compare=two

;

id personid ;

run .

Note this assumes both data sets are sorted by personid.

Another option is to use a where clause on the one with more records, but you still have to worry about sort order.

proc compare data=one

compare=two (where = ( not new record condition))

;

run .

HTH

Larry

Frequent Contributor
Posts: 87

Re: compare

I tried with the Id statement like the below code. But it does not work.

proc sort data=prep7;by acc_id;run;

proc sort data=prep7o;by acc_id;run;

proc compare base=prep7 compare=prep7o; id acc_id; run;

The problem is, when i sort with acc_id at record 2400 there is a missing record in table A. And sas shows all reords to be different after that record.

Frequent Contributor
Posts: 129

Re: compare

Are you sure your sorts worked?  You can check your log to determine that.  It does not make sense to me that if you sorted the table A by ACC_ID that record number 2400 should have a missing value, unless the 2399 record before it had missing values.  I reach this conclusion because (at least on ASCII machines), missing values are always the smallest possible value, regardless of whether it is a string or numeric variable.    I am real rusty on EBCDIC so if you are on a Z/OS machine I could be wrong.

If the sorts did work and you have 2399 missing values in each file, then the compare is pointing to a difference in your files.

Can you post some of the output from proc compare?

Super Contributor
Posts: 418

Re: compare

IK don't believe he's  saying his sort didn't work, what he's saying is the proc compare is throwing an error for every reord after number 2,400.

That's because BOTH datasets have the first 2,400 records in  common, however his dataset #2 is missing record 2,401 that exists in Dataset number 1.

Therefore, every record after that point is now incorrect because SAS is trying to compare record 2,401 in dataset A to record 2,401 in dataset B, however datatset B's record 2,401 is NOT equal to the same one in dataset A.

Correct me if I'm wrong, but i'm pretty sure this is what you mean because i've had it happen to me before.

However, by using the ID variable on your Key, this should not happen. Proc Compare is supposed to get an output like this...

Number of Observations in Common 9
Number of Observations in WORK.person but not in
WORK.person1
1
Total Number of Observations Read from WORK.person1 9
Total Number of Observations Read from WORK.person 10
Number of Observations with Some Compared Variables
Unequal
0
Number of Observations with All Compared Variables
Equal:
9
Frequent Contributor
Posts: 87

Re: compare

@Anotherdream: you are exactly right in understanding my issue.

But, I did not get the proc compare report like you showed it. Unless i misread it. I'll confirm it again. I do remember getting a list of all the records that are unequal and my output was so huge that it printed max 50 records.

I was only expecting the records that were not existing in my other table. Is there a way I can do this, since i need it as my validation report.

Solution
‎08-15-2013 07:19 PM
Frequent Contributor
Posts: 129

Re: compare

So does your output look something like the following.  I created a 10 record dataset and made a copy of it.  In the copy I deleted the fifth record  and changed the values of the non-id variable on the last five variables.

The COMPARE Procedure
Comparison of WORK.X with WORK.Y
(Method=EXACT)

Data Set Summary

Dataset           Created          Modified  NVar    NObs

WORK.X   15AUG13:17:47:52  15AUG13:17:47:52     2      10
WORK.Y   15AUG13:18:10:07  15AUG13:18:10:07     2       9

Variables Summary

Number of Variables in Common: 2.
Number of ID Variables: 1.

Observation Summary

Observation      Base  Compare  ID

First Obs           1        1  i=1
First Unequal       6        5  i=6
Last  Unequal      10        9  i=10
Last  Obs          10        9  i=10

Number of Observations in Common: 9.
Number of Observations in WORK.X but not in WORK.Y: 1.
Total Number of Observations Read from WORK.X: 10.
Total Number of Observations Read from WORK.Y: 9.

Number of Observations with Some Compared Variables Unequal: 5.
Number of Observations with All Compared Variables Equal: 4.

Values Comparison Summary

Number of Variables Compared with All Observations Equal: 0.
Number of Variables Compared with Some Observations Unequal: 1.
Total Number of Values which Compare Unequal: 5.
Maximum Difference: 10.

Comparison of WORK.X with WORK.Y
(Method=EXACT)

All Variables Compared have Unequal Values

Variable  Type  Len  Ndif   MaxDif

j         NUM     8     5   10.000

Value Comparison Results for Variables

_________________________________________________________
||       Base    Compare
i  ||          j          j      Diff.     % Diff
_______  ||  _________  _________  _________  _________
||
6  ||   120.0000   130.0000    10.0000     8.3333
7  ||   140.0000   150.0000    10.0000     7.1429
8  ||   160.0000   170.0000    10.0000     6.2500
9  ||   180.0000   190.0000    10.0000     5.5556
10  ||   200.0000   210.0000    10.0000     5.0000
_________________________________________________________

Take a look at the output dataset options from proc compare and see if there is an output datset which you could post process to make a more compact report.  The raw output from proc compare can be a little overwhelming when there are many data discrepencies.

Frequent Contributor
Posts: 87

Re: compare

Thank you all for your input. It worked now.

My ID variable had duplicate records. Hence I was not getting desired compare output. I added another variable that makes the ID unique, that worked.

🔒 This topic is solved and locked.