BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
helloSAS
Obsidian | Level 7

how can I use proc compare for datasets that have different number of records?

Example: table A has 5000 records; table B has 5025 records.I want to spit out the 25 records that do not exist in table A and also any other differences.

Thanks

1 ACCEPTED SOLUTION

Accepted Solutions
LarryWorley
Fluorite | Level 6

So does your output look something like the following.  I created a 10 record dataset and made a copy of it.  In the copy I deleted the fifth record  and changed the values of the non-id variable on the last five variables.

                                                                 The COMPARE Procedure
                                                                Comparison of WORK.X with WORK.Y
                                                                         (Method=EXACT)

                                                                       Data Set Summary

                                                   Dataset           Created          Modified  NVar    NObs

                                                   WORK.X   15AUG13:17:47:52  15AUG13:17:47:52     2      10
                                                   WORK.Y   15AUG13:18:10:07  15AUG13:18:10:07     2       9


                                                                       Variables Summary

                                                             Number of Variables in Common: 2.
                                                             Number of ID Variables: 1.


                                                                      Observation Summary

                                                            Observation      Base  Compare  ID

                                                            First Obs           1        1  i=1
                                                            First Unequal       6        5  i=6
                                                            Last  Unequal      10        9  i=10
                                                            Last  Obs          10        9  i=10

                                                Number of Observations in Common: 9.
                                                Number of Observations in WORK.X but not in WORK.Y: 1.
                                                Total Number of Observations Read from WORK.X: 10.
                                                Total Number of Observations Read from WORK.Y: 9.

                                                Number of Observations with Some Compared Variables Unequal: 5.
                                                Number of Observations with All Compared Variables Equal: 4.


                                                                   Values Comparison Summary

                                                Number of Variables Compared with All Observations Equal: 0.
                                                Number of Variables Compared with Some Observations Unequal: 1.
                                                Total Number of Values which Compare Unequal: 5.
                                                Maximum Difference: 10.


                                                            Comparison of WORK.X with WORK.Y
                                                                         (Method=EXACT)

                                                           All Variables Compared have Unequal Values

                                                              Variable  Type  Len  Ndif   MaxDif

                                                              j         NUM     8     5   10.000

                                                             Value Comparison Results for Variables

                                                   _________________________________________________________
                                                             ||       Base    Compare
                                                          i  ||          j          j      Diff.     % Diff
                                                    _______  ||  _________  _________  _________  _________
                                                             ||
                                                          6  ||   120.0000   130.0000    10.0000     8.3333
                                                          7  ||   140.0000   150.0000    10.0000     7.1429
                                                          8  ||   160.0000   170.0000    10.0000     6.2500
                                                          9  ||   180.0000   190.0000    10.0000     5.5556
                                                         10  ||   200.0000   210.0000    10.0000     5.0000
                                                   _________________________________________________________

Take a look at the output dataset options from proc compare and see if there is an output datset which you could post process to make a more compact report.  The raw output from proc compare can be a little overwhelming when there are many data discrepencies.

View solution in original post

8 REPLIES 8
SASKiwi
PROC Star

Proc Compare works in two ways:

1) Compares record for record between two tables (1 with 1, 2 with 2 etc)

2) Compares records using key variables common in both tables that uniquely identify which record in the first table is to be compared with which record in the second table.

So do your tables A and B contain one of more key variables that will uniquely identify a record?

If so use proc compare with an ID statement using these variables.


LarryWorley
Fluorite | Level 6

I like to use ID variables for this.  If you have a set of columns which uniquely identifies each record in both data sets, then ID statement lets proc compare handle the missing records appropriately.  So for example with two datasets with unique personids on each record and where the second data set contains 25 records not found on the first, this will workL

proc compare data=one

                    compare=two

                   ;

  id personid ;

run .

Note this assumes both data sets are sorted by personid.

Another option is to use a where clause on the one with more records, but you still have to worry about sort order.

proc compare data=one

                    compare=two (where = ( not new record condition))

                   ;

run .

HTH

Larry

helloSAS
Obsidian | Level 7


I tried with the Id statement like the below code. But it does not work.

proc sort data=prep7;by acc_id;run;

proc sort data=prep7o;by acc_id;run;

proc compare base=prep7 compare=prep7o; id acc_id; run;

The problem is, when i sort with acc_id at record 2400 there is a missing record in table A. And sas shows all reords to be different after that record.

LarryWorley
Fluorite | Level 6

Are you sure your sorts worked?  You can check your log to determine that.  It does not make sense to me that if you sorted the table A by ACC_ID that record number 2400 should have a missing value, unless the 2399 record before it had missing values.  I reach this conclusion because (at least on ASCII machines), missing values are always the smallest possible value, regardless of whether it is a string or numeric variable.    I am real rusty on EBCDIC so if you are on a Z/OS machine I could be wrong.Smiley Happy

If the sorts did work and you have 2399 missing values in each file, then the compare is pointing to a difference in your files.

Can you post some of the output from proc compare?

Anotherdream
Quartz | Level 8

IK don't believe he's  saying his sort didn't work, what he's saying is the proc compare is throwing an error for every reord after number 2,400.

That's because BOTH datasets have the first 2,400 records in  common, however his dataset #2 is missing record 2,401 that exists in Dataset number 1.

Therefore, every record after that point is now incorrect because SAS is trying to compare record 2,401 in dataset A to record 2,401 in dataset B, however datatset B's record 2,401 is NOT equal to the same one in dataset A.

Correct me if I'm wrong, but i'm pretty sure this is what you mean because i've had it happen to me before.

However, by using the ID variable on your Key, this should not happen. Proc Compare is supposed to get an output like this...


Number of Observations in Common 9
Number of Observations in WORK.person but not in
WORK.person1
1
Total Number of Observations Read from WORK.person1 9
Total Number of Observations Read from WORK.person 10
Number of Observations with Some Compared Variables
Unequal
0
Number of Observations with All Compared Variables
Equal:
9
helloSAS
Obsidian | Level 7

@Anotherdream: you are exactly right in understanding my issue.

But, I did not get the proc compare report like you showed it. Unless i misread it. I'll confirm it again. I do remember getting a list of all the records that are unequal and my output was so huge that it printed max 50 records.

I was only expecting the records that were not existing in my other table. Is there a way I can do this, since i need it as my validation report.

LarryWorley
Fluorite | Level 6

So does your output look something like the following.  I created a 10 record dataset and made a copy of it.  In the copy I deleted the fifth record  and changed the values of the non-id variable on the last five variables.

                                                                 The COMPARE Procedure
                                                                Comparison of WORK.X with WORK.Y
                                                                         (Method=EXACT)

                                                                       Data Set Summary

                                                   Dataset           Created          Modified  NVar    NObs

                                                   WORK.X   15AUG13:17:47:52  15AUG13:17:47:52     2      10
                                                   WORK.Y   15AUG13:18:10:07  15AUG13:18:10:07     2       9


                                                                       Variables Summary

                                                             Number of Variables in Common: 2.
                                                             Number of ID Variables: 1.


                                                                      Observation Summary

                                                            Observation      Base  Compare  ID

                                                            First Obs           1        1  i=1
                                                            First Unequal       6        5  i=6
                                                            Last  Unequal      10        9  i=10
                                                            Last  Obs          10        9  i=10

                                                Number of Observations in Common: 9.
                                                Number of Observations in WORK.X but not in WORK.Y: 1.
                                                Total Number of Observations Read from WORK.X: 10.
                                                Total Number of Observations Read from WORK.Y: 9.

                                                Number of Observations with Some Compared Variables Unequal: 5.
                                                Number of Observations with All Compared Variables Equal: 4.


                                                                   Values Comparison Summary

                                                Number of Variables Compared with All Observations Equal: 0.
                                                Number of Variables Compared with Some Observations Unequal: 1.
                                                Total Number of Values which Compare Unequal: 5.
                                                Maximum Difference: 10.


                                                            Comparison of WORK.X with WORK.Y
                                                                         (Method=EXACT)

                                                           All Variables Compared have Unequal Values

                                                              Variable  Type  Len  Ndif   MaxDif

                                                              j         NUM     8     5   10.000

                                                             Value Comparison Results for Variables

                                                   _________________________________________________________
                                                             ||       Base    Compare
                                                          i  ||          j          j      Diff.     % Diff
                                                    _______  ||  _________  _________  _________  _________
                                                             ||
                                                          6  ||   120.0000   130.0000    10.0000     8.3333
                                                          7  ||   140.0000   150.0000    10.0000     7.1429
                                                          8  ||   160.0000   170.0000    10.0000     6.2500
                                                          9  ||   180.0000   190.0000    10.0000     5.5556
                                                         10  ||   200.0000   210.0000    10.0000     5.0000
                                                   _________________________________________________________

Take a look at the output dataset options from proc compare and see if there is an output datset which you could post process to make a more compact report.  The raw output from proc compare can be a little overwhelming when there are many data discrepencies.

helloSAS
Obsidian | Level 7

Thank you all for your input. It worked now.

My ID variable had duplicate records. Hence I was not getting desired compare output. I added another variable that makes the ID unique, that worked.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 8 replies
  • 1334 views
  • 3 likes
  • 4 in conversation