- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
how can I use proc compare for datasets that have different number of records?
Example: table A has 5000 records; table B has 5025 records.I want to spit out the 25 records that do not exist in table A and also any other differences.
Thanks
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
So does your output look something like the following. I created a 10 record dataset and made a copy of it. In the copy I deleted the fifth record and changed the values of the non-id variable on the last five variables.
The COMPARE Procedure
Comparison of WORK.X with WORK.Y
(Method=EXACT)
Data Set Summary
Dataset Created Modified NVar NObs
WORK.X 15AUG13:17:47:52 15AUG13:17:47:52 2 10
WORK.Y 15AUG13:18:10:07 15AUG13:18:10:07 2 9
Variables Summary
Number of Variables in Common: 2.
Number of ID Variables: 1.
Observation Summary
Observation Base Compare ID
First Obs 1 1 i=1
First Unequal 6 5 i=6
Last Unequal 10 9 i=10
Last Obs 10 9 i=10
Number of Observations in Common: 9.
Number of Observations in WORK.X but not in WORK.Y: 1.
Total Number of Observations Read from WORK.X: 10.
Total Number of Observations Read from WORK.Y: 9.
Number of Observations with Some Compared Variables Unequal: 5.
Number of Observations with All Compared Variables Equal: 4.
Values Comparison Summary
Number of Variables Compared with All Observations Equal: 0.
Number of Variables Compared with Some Observations Unequal: 1.
Total Number of Values which Compare Unequal: 5.
Maximum Difference: 10.
Comparison of WORK.X with WORK.Y
(Method=EXACT)
All Variables Compared have Unequal Values
Variable Type Len Ndif MaxDif
j NUM 8 5 10.000
Value Comparison Results for Variables
_________________________________________________________
|| Base Compare
i || j j Diff. % Diff
_______ || _________ _________ _________ _________
||
6 || 120.0000 130.0000 10.0000 8.3333
7 || 140.0000 150.0000 10.0000 7.1429
8 || 160.0000 170.0000 10.0000 6.2500
9 || 180.0000 190.0000 10.0000 5.5556
10 || 200.0000 210.0000 10.0000 5.0000
_________________________________________________________
Take a look at the output dataset options from proc compare and see if there is an output datset which you could post process to make a more compact report. The raw output from proc compare can be a little overwhelming when there are many data discrepencies.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Proc Compare works in two ways:
1) Compares record for record between two tables (1 with 1, 2 with 2 etc)
2) Compares records using key variables common in both tables that uniquely identify which record in the first table is to be compared with which record in the second table.
So do your tables A and B contain one of more key variables that will uniquely identify a record?
If so use proc compare with an ID statement using these variables.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I like to use ID variables for this. If you have a set of columns which uniquely identifies each record in both data sets, then ID statement lets proc compare handle the missing records appropriately. So for example with two datasets with unique personids on each record and where the second data set contains 25 records not found on the first, this will workL
proc compare data=one
compare=two
;
id personid ;
run .
Note this assumes both data sets are sorted by personid.
Another option is to use a where clause on the one with more records, but you still have to worry about sort order.
proc compare data=one
compare=two (where = ( not new record condition))
;
run .
HTH
Larry
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I tried with the Id statement like the below code. But it does not work.
proc sort data=prep7;by acc_id;run;
proc sort data=prep7o;by acc_id;run;
proc compare base=prep7 compare=prep7o; id acc_id; run;
The problem is, when i sort with acc_id at record 2400 there is a missing record in table A. And sas shows all reords to be different after that record.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Are you sure your sorts worked? You can check your log to determine that. It does not make sense to me that if you sorted the table A by ACC_ID that record number 2400 should have a missing value, unless the 2399 record before it had missing values. I reach this conclusion because (at least on ASCII machines), missing values are always the smallest possible value, regardless of whether it is a string or numeric variable. I am real rusty on EBCDIC so if you are on a Z/OS machine I could be wrong.
If the sorts did work and you have 2399 missing values in each file, then the compare is pointing to a difference in your files.
Can you post some of the output from proc compare?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
IK don't believe he's saying his sort didn't work, what he's saying is the proc compare is throwing an error for every reord after number 2,400.
That's because BOTH datasets have the first 2,400 records in common, however his dataset #2 is missing record 2,401 that exists in Dataset number 1.
Therefore, every record after that point is now incorrect because SAS is trying to compare record 2,401 in dataset A to record 2,401 in dataset B, however datatset B's record 2,401 is NOT equal to the same one in dataset A.
Correct me if I'm wrong, but i'm pretty sure this is what you mean because i've had it happen to me before.
However, by using the ID variable on your Key, this should not happen. Proc Compare is supposed to get an output like this...
Number of Observations in Common | 9 |
Number of Observations in WORK.person but not in WORK.person1 | 1 |
Total Number of Observations Read from WORK.person1 | 9 |
Total Number of Observations Read from WORK.person | 10 |
Number of Observations with Some Compared Variables Unequal | 0 |
Number of Observations with All Compared Variables Equal: | 9 |
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@Anotherdream: you are exactly right in understanding my issue.
But, I did not get the proc compare report like you showed it. Unless i misread it. I'll confirm it again. I do remember getting a list of all the records that are unequal and my output was so huge that it printed max 50 records.
I was only expecting the records that were not existing in my other table. Is there a way I can do this, since i need it as my validation report.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
So does your output look something like the following. I created a 10 record dataset and made a copy of it. In the copy I deleted the fifth record and changed the values of the non-id variable on the last five variables.
The COMPARE Procedure
Comparison of WORK.X with WORK.Y
(Method=EXACT)
Data Set Summary
Dataset Created Modified NVar NObs
WORK.X 15AUG13:17:47:52 15AUG13:17:47:52 2 10
WORK.Y 15AUG13:18:10:07 15AUG13:18:10:07 2 9
Variables Summary
Number of Variables in Common: 2.
Number of ID Variables: 1.
Observation Summary
Observation Base Compare ID
First Obs 1 1 i=1
First Unequal 6 5 i=6
Last Unequal 10 9 i=10
Last Obs 10 9 i=10
Number of Observations in Common: 9.
Number of Observations in WORK.X but not in WORK.Y: 1.
Total Number of Observations Read from WORK.X: 10.
Total Number of Observations Read from WORK.Y: 9.
Number of Observations with Some Compared Variables Unequal: 5.
Number of Observations with All Compared Variables Equal: 4.
Values Comparison Summary
Number of Variables Compared with All Observations Equal: 0.
Number of Variables Compared with Some Observations Unequal: 1.
Total Number of Values which Compare Unequal: 5.
Maximum Difference: 10.
Comparison of WORK.X with WORK.Y
(Method=EXACT)
All Variables Compared have Unequal Values
Variable Type Len Ndif MaxDif
j NUM 8 5 10.000
Value Comparison Results for Variables
_________________________________________________________
|| Base Compare
i || j j Diff. % Diff
_______ || _________ _________ _________ _________
||
6 || 120.0000 130.0000 10.0000 8.3333
7 || 140.0000 150.0000 10.0000 7.1429
8 || 160.0000 170.0000 10.0000 6.2500
9 || 180.0000 190.0000 10.0000 5.5556
10 || 200.0000 210.0000 10.0000 5.0000
_________________________________________________________
Take a look at the output dataset options from proc compare and see if there is an output datset which you could post process to make a more compact report. The raw output from proc compare can be a little overwhelming when there are many data discrepencies.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thank you all for your input. It worked now.
My ID variable had duplicate records. Hence I was not getting desired compare output. I added another variable that makes the ID unique, that worked.