BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Babloo
Rhodochrosite | Level 12

I ran the following code to find the records which are available in A and not in B. I've 5918065 records in A dataset and 5902253 records in B dataset and I want to see the missing 15812 records. After running this code, I see only a 12180 records in 'Non_Matching_Feb_2017' dataset instead of 15812. Could someone of you help me with the code to see the missing 15812 records in Dataset B? 

  

Data Matching_Feb_2017 Non_Matching_Feb_2017;

merge Feb_nbrs_without_dt(in=a) source_33_count(in=b);

by POLICY_No;

if a and b then output Matching_Feb_2017;

if a and not b then output Non_Matching_Feb_2017;

run;

 

Log:

  

NOTE: MERGE statement has more than one data set with repeats of BY values.
NOTE: There were 5918065 observations read from the data set WORK.FEB_NBRS_WITHOUT_DT.
NOTE: There were 5902253 observations read from the data set WORK.SOURCE_33_COUNT.
NOTE: The data set WORK.MATCHING_FEB_2017 has 5905885 observations and 34 variables.
NOTE: The data set WORK.NON_MATCHING_FEB_2017 has 12180 observations and 34 variables.

 

1 ACCEPTED SOLUTION

Accepted Solutions
Astounding
PROC Star

Actually, the 12,180 number is correct.  But other issues require attention.

 

First, POLICY_NO is not unique.  It is not unique in the data set A, and not unique in data set B.  That's what this line is telling you:

 

NOTE: MERGE statement has more than one data set with repeats of BY values.

 

Maybe that means you have to fix the data first.  Maybe that means you should be merging by more than one variable instead of just by POLICY_NO.  Maybe that means you need to adjust your expectations about the number of observations.  Ironically, however, the 12,180 number is correct.  There are 12,180 observations where the POLICY_NO appears in A but not in B.  That might represent 12,180 unique policy numbers, or it might represent 1 POLICY_NO that appears 12,180 times in the data set A (or some "in between" mix).  You would need to inspect the output data set to determine that.

 

Take this example:  POLICY_NO "ABC" appears 1,000 times in data set A, but not at all in data set B.  There would be 1,000 more observations in data set A than in data set B.  But only one POLICY_NO that is mismatched.

 

 

View solution in original post

19 REPLIES 19
VDD
Ammonite | Level 13 VDD
Ammonite | Level 13

did you check to see if you have records in b that are not in a?

 

Babloo
Rhodochrosite | Level 12

Done already. it's 0 observations.

SuryaKiran
Meteorite | Level 14

 

Hi,

 

If Table A has 5,918,065 and Table B has 5,902,253 records why are you expecting that all records from table B are present in Table A.

In your condition your are saying that get all the records if Policy_no is present in Table A and Not present in Table B. 

Thanks,
Suryakiran
Babloo
Rhodochrosite | Level 12

I'm not excepting all records from table B should be present in Table A. I would like to see the records which are available in A dataset and not in B dataset irrespective of any condition. Could you please help with the SQL to achieve this?

VDD
Ammonite | Level 13 VDD
Ammonite | Level 13

proc sql;

create table abc as

select distinct * from a

where policy_no not in

(select distinct policy_no

from b);

quit;

SuryaKiran
Meteorite | Level 14

Did you check for duplicate values in the data-sets?

Thanks,
Suryakiran
Babloo
Rhodochrosite | Level 12
Yes, there are duplicates but still it is not matching the number what I
expect. I'm not certain whether changing the by variable will work here.
SuryaKiran
Meteorite | Level 14

Change the BY variable so that you have all unique values in both the tables. If you don't get what you expected then you may have duplicates through all the by vars.

 

check id proc sort nodupkey removes any duplicates in both the tables for the BY variables.

 

Why are you worried about the counts, the process your trying is giving you the right output and you may not be right about the count your expecting because you are including the duplicate records.

 

Thanks,
Suryakiran
Astounding
PROC Star

Actually, the 12,180 number is correct.  But other issues require attention.

 

First, POLICY_NO is not unique.  It is not unique in the data set A, and not unique in data set B.  That's what this line is telling you:

 

NOTE: MERGE statement has more than one data set with repeats of BY values.

 

Maybe that means you have to fix the data first.  Maybe that means you should be merging by more than one variable instead of just by POLICY_NO.  Maybe that means you need to adjust your expectations about the number of observations.  Ironically, however, the 12,180 number is correct.  There are 12,180 observations where the POLICY_NO appears in A but not in B.  That might represent 12,180 unique policy numbers, or it might represent 1 POLICY_NO that appears 12,180 times in the data set A (or some "in between" mix).  You would need to inspect the output data set to determine that.

 

Take this example:  POLICY_NO "ABC" appears 1,000 times in data set A, but not at all in data set B.  There would be 1,000 more observations in data set A than in data set B.  But only one POLICY_NO that is mismatched.

 

 

Babloo
Rhodochrosite | Level 12
In order to get the number what I expect, can I merge it with all the
variables?
Reeza
Super User
Data Matching_Feb_2017 Non_Matching_Feb_2017 CHECK_DATA;

merge Feb_nbrs_without_dt(in=a) source_33_count(in=b);

by POLICY_No;

if a and b then output Matching_Feb_2017;

if a and not b then output Non_Matching_Feb_2017;

if b and not a then output check_data; run;

 

Post the log from the code above. 

 

Second - this type of message usually means your join needs another variable. So which SET of variables uniquely identify each record, it's clearly not just Policy_NO. If it's insurance policy there's usually a start or end date or renewal number that can be used as well to join because my policy number doesn't change every year, but it has a new entry in the data every year. 

 

Babloo
Rhodochrosite | Level 12

Here you go:

 

NOTE: MERGE statement has more than one data set with repeats of BY values.
NOTE: There were 5918065 observations read from the data set WORK.FEB_NBRS_WITHOUT_DT.
NOTE: There were 5902253 observations read from the data set WORK.SOURCE_33_COUNT.
NOTE: The data set WORK.MATCHING_FEB_2017 has 5905885 observations and 34 variables.
NOTE: The data set WORK.NON_MATCHING_FEB_2017 has 12180 observations and 34 variables.
NOTE: The data set WORK.CHECK_DATA has 0 observations and 34 variables.
Babloo
Rhodochrosite | Level 12

With the below code, I'm getting the desired number in the output. Now could you please help me how can I say to the layman why 15812 records are not matching in 'Non_Matching_Feb_2017' dataset? Since I'm new to the banking and insurance domain I find difficult to convey my output verbally to the layman.

 

 

Data Matching_Feb_2017 Non_Matching_Feb_2017;

merge Feb_nbrs_without_dt(in=a) source_33_count(in=b);

by POLICY_No PREMIUM_BEG_DT;

if a and b then output Matching_Feb_2017;

if a and not b then output Non_Matching_Feb_2017;

/*if b and not a then output Non_Matching_Dec_2016;*/

run;

 

VDD
Ammonite | Level 13 VDD
Ammonite | Level 13
during Feb 15812 records where not found with a policy number and a premium begin date, This affected 12180 policies.
think this states your meaning.

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 19 replies
  • 9753 views
  • 7 likes
  • 6 in conversation