Solved
PROC Star
Posts: 633

# How to find the Mismatched records?

I ran the following code to find the records which are available in A and not in B. I've 5918065 records in A dataset and 5902253 records in B dataset and I want to see the missing 15812 records. After running this code, I see only a 12180 records in 'Non_Matching_Feb_2017' dataset instead of 15812. Could someone of you help me with the code to see the missing 15812 records in Dataset B?

``````Data Matching_Feb_2017 Non_Matching_Feb_2017;

merge Feb_nbrs_without_dt(in=a) source_33_count(in=b);

by POLICY_No;

if a and b then output Matching_Feb_2017;

if a and not b then output Non_Matching_Feb_2017;

run;
``````

Log:

``````NOTE: MERGE statement has more than one data set with repeats of BY values.
NOTE: There were 5918065 observations read from the data set WORK.FEB_NBRS_WITHOUT_DT.
NOTE: There were 5902253 observations read from the data set WORK.SOURCE_33_COUNT.
NOTE: The data set WORK.MATCHING_FEB_2017 has 5905885 observations and 34 variables.
NOTE: The data set WORK.NON_MATCHING_FEB_2017 has 12180 observations and 34 variables.
``````

Accepted Solutions
Solution
‎03-05-2018 12:33 AM
Super User
Posts: 6,921

## Re: How to find the Mismatched records?

Actually, the 12,180 number is correct.  But other issues require attention.

First, POLICY_NO is not unique.  It is not unique in the data set A, and not unique in data set B.  That's what this line is telling you:

NOTE: MERGE statement has more than one data set with repeats of BY values.

Maybe that means you have to fix the data first.  Maybe that means you should be merging by more than one variable instead of just by POLICY_NO.  Maybe that means you need to adjust your expectations about the number of observations.  Ironically, however, the 12,180 number is correct.  There are 12,180 observations where the POLICY_NO appears in A but not in B.  That might represent 12,180 unique policy numbers, or it might represent 1 POLICY_NO that appears 12,180 times in the data set A (or some "in between" mix).  You would need to inspect the output data set to determine that.

Take this example:  POLICY_NO "ABC" appears 1,000 times in data set A, but not at all in data set B.  There would be 1,000 more observations in data set A than in data set B.  But only one POLICY_NO that is mismatched.

All Replies
Contributor
Posts: 27

## Re: How to find the Mismatched records?

did you check to see if you have records in b that are not in a?

PROC Star
Posts: 633

PROC Star
Posts: 624

## Re: How to find the Mismatched records?

Hi,

If Table A has 5,918,065 and Table B has 5,902,253 records why are you expecting that all records from table B are present in Table A.

In your condition your are saying that get all the records if Policy_no is present in Table A and Not present in Table B.

Thanks,
Suryakiran
PROC Star
Posts: 633

## Re: How to find the Mismatched records?

I'm not excepting all records from table B should be present in Table A. I would like to see the records which are available in A dataset and not in B dataset irrespective of any condition. Could you please help with the SQL to achieve this?

Contributor
Posts: 27

## Re: How to find the Mismatched records?

proc sql;

create table abc as

select distinct * from a

where policy_no not in

(select distinct policy_no

from b);

quit;

PROC Star
Posts: 624

## Re: How to find the Mismatched records?

Did you check for duplicate values in the data-sets?

Thanks,
Suryakiran
PROC Star
Posts: 633

## Re: How to find the Mismatched records?

Yes, there are duplicates but still it is not matching the number what I
expect. I'm not certain whether changing the by variable will work here.
PROC Star
Posts: 624

## Re: How to find the Mismatched records?

[ Edited ]

Change the BY variable so that you have all unique values in both the tables. If you don't get what you expected then you may have duplicates through all the by vars.

check id proc sort nodupkey removes any duplicates in both the tables for the BY variables.

Why are you worried about the counts, the process your trying is giving you the right output and you may not be right about the count your expecting because you are including the duplicate records.

Thanks,
Suryakiran
Solution
‎03-05-2018 12:33 AM
Super User
Posts: 6,921

## Re: How to find the Mismatched records?

Actually, the 12,180 number is correct.  But other issues require attention.

First, POLICY_NO is not unique.  It is not unique in the data set A, and not unique in data set B.  That's what this line is telling you:

NOTE: MERGE statement has more than one data set with repeats of BY values.

Maybe that means you have to fix the data first.  Maybe that means you should be merging by more than one variable instead of just by POLICY_NO.  Maybe that means you need to adjust your expectations about the number of observations.  Ironically, however, the 12,180 number is correct.  There are 12,180 observations where the POLICY_NO appears in A but not in B.  That might represent 12,180 unique policy numbers, or it might represent 1 POLICY_NO that appears 12,180 times in the data set A (or some "in between" mix).  You would need to inspect the output data set to determine that.

Take this example:  POLICY_NO "ABC" appears 1,000 times in data set A, but not at all in data set B.  There would be 1,000 more observations in data set A than in data set B.  But only one POLICY_NO that is mismatched.

PROC Star
Posts: 633

## Re: How to find the Mismatched records?

In order to get the number what I expect, can I merge it with all the
variables?
Super User
Posts: 23,980

## Re: How to find the Mismatched records?

``````Data Matching_Feb_2017 Non_Matching_Feb_2017 CHECK_DATA;

merge Feb_nbrs_without_dt(in=a) source_33_count(in=b);

by POLICY_No;

if a and b then output Matching_Feb_2017;

if a and not b then output Non_Matching_Feb_2017;if b and not a then output check_data;

run;``````

Post the log from the code above.

Second - this type of message usually means your join needs another variable. So which SET of variables uniquely identify each record, it's clearly not just Policy_NO. If it's insurance policy there's usually a start or end date or renewal number that can be used as well to join because my policy number doesn't change every year, but it has a new entry in the data every year.

PROC Star
Posts: 633

## Re: How to find the Mismatched records?

Here you go:

``````NOTE: MERGE statement has more than one data set with repeats of BY values.
NOTE: There were 5918065 observations read from the data set WORK.FEB_NBRS_WITHOUT_DT.
NOTE: There were 5902253 observations read from the data set WORK.SOURCE_33_COUNT.
NOTE: The data set WORK.MATCHING_FEB_2017 has 5905885 observations and 34 variables.
NOTE: The data set WORK.NON_MATCHING_FEB_2017 has 12180 observations and 34 variables.
NOTE: The data set WORK.CHECK_DATA has 0 observations and 34 variables.``````
PROC Star
Posts: 633

## Re: How to find the Mismatched records?

With the below code, I'm getting the desired number in the output. Now could you please help me how can I say to the layman why 15812 records are not matching in 'Non_Matching_Feb_2017' dataset? Since I'm new to the banking and insurance domain I find difficult to convey my output verbally to the layman.

``````Data Matching_Feb_2017 Non_Matching_Feb_2017;

merge Feb_nbrs_without_dt(in=a) source_33_count(in=b);

if a and b then output Matching_Feb_2017;

if a and not b then output Non_Matching_Feb_2017;

/*if b and not a then output Non_Matching_Dec_2016;*/

run;
``````

Contributor
Posts: 27

## Re: How to find the Mismatched records?

during Feb 15812 records where not found with a policy number and a premium begin date, This affected 12180 policies.