Solved: Re: How to find the Mismatched records?

Babloo · Posted 03-01-2018 10:24 AM

I ran the following code to find the records which are available in A and not in B. I've 5918065 records in A dataset and 5902253 records in B dataset and I want to see the missing 15812 records. After running this code, I see only a 12180 records in 'Non_Matching_Feb_2017' dataset instead of 15812. Could someone of you help me with the code to see the missing 15812 records in Dataset B?

Data Matching_Feb_2017 Non_Matching_Feb_2017;

merge Feb_nbrs_without_dt(in=a) source_33_count(in=b);

by POLICY_No;

if a and b then output Matching_Feb_2017;

if a and not b then output Non_Matching_Feb_2017;

run;

Log:

NOTE: MERGE statement has more than one data set with repeats of BY values.
NOTE: There were 5918065 observations read from the data set WORK.FEB_NBRS_WITHOUT_DT.
NOTE: There were 5902253 observations read from the data set WORK.SOURCE_33_COUNT.
NOTE: The data set WORK.MATCHING_FEB_2017 has 5905885 observations and 34 variables.
NOTE: The data set WORK.NON_MATCHING_FEB_2017 has 12180 observations and 34 variables.

Astounding · Posted 03-01-2018 11:05 AM

Actually, the 12,180 number is correct. But other issues require attention.

First, POLICY_NO is not unique. It is not unique in the data set A, and not unique in data set B. That's what this line is telling you:

NOTE: MERGE statement has more than one data set with repeats of BY values.

Maybe that means you have to fix the data first. Maybe that means you should be merging by more than one variable instead of just by POLICY_NO. Maybe that means you need to adjust your expectations about the number of observations. Ironically, however, the 12,180 number is correct. There are 12,180 observations where the POLICY_NO appears in A but not in B. That might represent 12,180 unique policy numbers, or it might represent 1 POLICY_NO that appears 12,180 times in the data set A (or some "in between" mix). You would need to inspect the output data set to determine that.

Take this example: POLICY_NO "ABC" appears 1,000 times in data set A, but not at all in data set B. There would be 1,000 more observations in data set A than in data set B. But only one POLICY_NO that is mismatched.

View solution in original post

VDD · Posted 03-01-2018 10:31 AM

did you check to see if you have records in b that are not in a?

Babloo · Posted 03-01-2018 10:46 AM

Done already. it's 0 observations.

SuryaKiran · Posted 03-01-2018 10:40 AM

Hi,

If Table A has 5,918,065 and Table B has 5,902,253 records why are you expecting that all records from table B are present in Table A.

In your condition your are saying that get all the records if Policy_no is present in Table A and Not present in Table B.

Thanks,
Suryakiran

Babloo · Posted 03-01-2018 10:48 AM

I'm not excepting all records from table B should be present in Table A. I would like to see the records which are available in A dataset and not in B dataset irrespective of any condition. Could you please help with the SQL to achieve this?

VDD · Posted 03-01-2018 10:56 AM

proc sql;

create table abc as

select distinct * from a

where policy_no not in

(select distinct policy_no

from b);

quit;

SuryaKiran · Posted 03-01-2018 10:57 AM

Did you check for duplicate values in the data-sets?

Thanks,
Suryakiran

Babloo · Posted 03-01-2018 11:07 AM

Yes, there are duplicates but still it is not matching the number what I
expect. I'm not certain whether changing the by variable will work here.

SuryaKiran · Posted 03-01-2018 11:11 AM

Change the BY variable so that you have all unique values in both the tables. If you don't get what you expected then you may have duplicates through all the by vars.

check id proc sort nodupkey removes any duplicates in both the tables for the BY variables.

Why are you worried about the counts, the process your trying is giving you the right output and you may not be right about the count your expecting because you are including the duplicate records.

Thanks,
Suryakiran

Astounding · Posted 03-01-2018 11:05 AM

Actually, the 12,180 number is correct. But other issues require attention.

First, POLICY_NO is not unique. It is not unique in the data set A, and not unique in data set B. That's what this line is telling you:

NOTE: MERGE statement has more than one data set with repeats of BY values.

Maybe that means you have to fix the data first. Maybe that means you should be merging by more than one variable instead of just by POLICY_NO. Maybe that means you need to adjust your expectations about the number of observations. Ironically, however, the 12,180 number is correct. There are 12,180 observations where the POLICY_NO appears in A but not in B. That might represent 12,180 unique policy numbers, or it might represent 1 POLICY_NO that appears 12,180 times in the data set A (or some "in between" mix). You would need to inspect the output data set to determine that.

Take this example: POLICY_NO "ABC" appears 1,000 times in data set A, but not at all in data set B. There would be 1,000 more observations in data set A than in data set B. But only one POLICY_NO that is mismatched.

Babloo · Posted 03-01-2018 11:17 AM

In order to get the number what I expect, can I merge it with all the
variables?

Reeza · Posted 03-01-2018 11:19 AM

Data Matching_Feb_2017 Non_Matching_Feb_2017 CHECK_DATA;

merge Feb_nbrs_without_dt(in=a) source_33_count(in=b);

by POLICY_No;

if a and b then output Matching_Feb_2017;

if a and not b then output Non_Matching_Feb_2017;

if b and not a then output check_data;

run;

Post the log from the code above.

Second - this type of message usually means your join needs another variable. So which SET of variables uniquely identify each record, it's clearly not just Policy_NO. If it's insurance policy there's usually a start or end date or renewal number that can be used as well to join because my policy number doesn't change every year, but it has a new entry in the data every year.

Babloo · Posted 03-01-2018 11:28 AM

Here you go:

NOTE: MERGE statement has more than one data set with repeats of BY values.
NOTE: There were 5918065 observations read from the data set WORK.FEB_NBRS_WITHOUT_DT.
NOTE: There were 5902253 observations read from the data set WORK.SOURCE_33_COUNT.
NOTE: The data set WORK.MATCHING_FEB_2017 has 5905885 observations and 34 variables.
NOTE: The data set WORK.NON_MATCHING_FEB_2017 has 12180 observations and 34 variables.
NOTE: The data set WORK.CHECK_DATA has 0 observations and 34 variables.

Babloo · Posted 03-01-2018 11:52 AM

With the below code, I'm getting the desired number in the output. Now could you please help me how can I say to the layman why 15812 records are not matching in 'Non_Matching_Feb_2017' dataset? Since I'm new to the banking and insurance domain I find difficult to convey my output verbally to the layman.

Data Matching_Feb_2017 Non_Matching_Feb_2017;

merge Feb_nbrs_without_dt(in=a) source_33_count(in=b);

by POLICY_No PREMIUM_BEG_DT;

if a and b then output Matching_Feb_2017;

if a and not b then output Non_Matching_Feb_2017;

/*if b and not a then output Non_Matching_Dec_2016;*/

run;

VDD · Posted 03-01-2018 11:58 AM

during Feb 15812 records where not found with a policy number and a premium begin date, This affected 12180 policies.
think this states your meaning.

SAS Training: Just a Click Away