I ran the following code to find the records which are available in A and not in B. I've 5918065 records in A dataset and 5902253 records in B dataset and I want to see the missing 15812 records. After running this code, I see only a 12180 records in 'Non_Matching_Feb_2017' dataset instead of 15812. Could someone of you help me with the code to see the missing 15812 records in Dataset B?
Data Matching_Feb_2017 Non_Matching_Feb_2017;
merge Feb_nbrs_without_dt(in=a) source_33_count(in=b);
by POLICY_No;
if a and b then output Matching_Feb_2017;
if a and not b then output Non_Matching_Feb_2017;
run;
Log:
NOTE: MERGE statement has more than one data set with repeats of BY values.
NOTE: There were 5918065 observations read from the data set WORK.FEB_NBRS_WITHOUT_DT.
NOTE: There were 5902253 observations read from the data set WORK.SOURCE_33_COUNT.
NOTE: The data set WORK.MATCHING_FEB_2017 has 5905885 observations and 34 variables.
NOTE: The data set WORK.NON_MATCHING_FEB_2017 has 12180 observations and 34 variables.
Actually, the 12,180 number is correct. But other issues require attention.
First, POLICY_NO is not unique. It is not unique in the data set A, and not unique in data set B. That's what this line is telling you:
NOTE: MERGE statement has more than one data set with repeats of BY values.
Maybe that means you have to fix the data first. Maybe that means you should be merging by more than one variable instead of just by POLICY_NO. Maybe that means you need to adjust your expectations about the number of observations. Ironically, however, the 12,180 number is correct. There are 12,180 observations where the POLICY_NO appears in A but not in B. That might represent 12,180 unique policy numbers, or it might represent 1 POLICY_NO that appears 12,180 times in the data set A (or some "in between" mix). You would need to inspect the output data set to determine that.
Take this example: POLICY_NO "ABC" appears 1,000 times in data set A, but not at all in data set B. There would be 1,000 more observations in data set A than in data set B. But only one POLICY_NO that is mismatched.
did you check to see if you have records in b that are not in a?
Done already. it's 0 observations.
Hi,
If Table A has 5,918,065 and Table B has 5,902,253 records why are you expecting that all records from table B are present in Table A.
In your condition your are saying that get all the records if Policy_no is present in Table A and Not present in Table B.
I'm not excepting all records from table B should be present in Table A. I would like to see the records which are available in A dataset and not in B dataset irrespective of any condition. Could you please help with the SQL to achieve this?
proc sql;
create table abc as
select distinct * from a
where policy_no not in
(select distinct policy_no
from b);
quit;
Did you check for duplicate values in the data-sets?
Change the BY variable so that you have all unique values in both the tables. If you don't get what you expected then you may have duplicates through all the by vars.
check id proc sort nodupkey removes any duplicates in both the tables for the BY variables.
Why are you worried about the counts, the process your trying is giving you the right output and you may not be right about the count your expecting because you are including the duplicate records.
Actually, the 12,180 number is correct. But other issues require attention.
First, POLICY_NO is not unique. It is not unique in the data set A, and not unique in data set B. That's what this line is telling you:
NOTE: MERGE statement has more than one data set with repeats of BY values.
Maybe that means you have to fix the data first. Maybe that means you should be merging by more than one variable instead of just by POLICY_NO. Maybe that means you need to adjust your expectations about the number of observations. Ironically, however, the 12,180 number is correct. There are 12,180 observations where the POLICY_NO appears in A but not in B. That might represent 12,180 unique policy numbers, or it might represent 1 POLICY_NO that appears 12,180 times in the data set A (or some "in between" mix). You would need to inspect the output data set to determine that.
Take this example: POLICY_NO "ABC" appears 1,000 times in data set A, but not at all in data set B. There would be 1,000 more observations in data set A than in data set B. But only one POLICY_NO that is mismatched.
Data Matching_Feb_2017 Non_Matching_Feb_2017 CHECK_DATA;
merge Feb_nbrs_without_dt(in=a) source_33_count(in=b);
by POLICY_No;
if a and b then output Matching_Feb_2017;
if a and not b then output Non_Matching_Feb_2017;
if b and not a then output check_data;
run;
Post the log from the code above.
Second - this type of message usually means your join needs another variable. So which SET of variables uniquely identify each record, it's clearly not just Policy_NO. If it's insurance policy there's usually a start or end date or renewal number that can be used as well to join because my policy number doesn't change every year, but it has a new entry in the data every year.
Here you go:
NOTE: MERGE statement has more than one data set with repeats of BY values.
NOTE: There were 5918065 observations read from the data set WORK.FEB_NBRS_WITHOUT_DT.
NOTE: There were 5902253 observations read from the data set WORK.SOURCE_33_COUNT.
NOTE: The data set WORK.MATCHING_FEB_2017 has 5905885 observations and 34 variables.
NOTE: The data set WORK.NON_MATCHING_FEB_2017 has 12180 observations and 34 variables.
NOTE: The data set WORK.CHECK_DATA has 0 observations and 34 variables.
With the below code, I'm getting the desired number in the output. Now could you please help me how can I say to the layman why 15812 records are not matching in 'Non_Matching_Feb_2017' dataset? Since I'm new to the banking and insurance domain I find difficult to convey my output verbally to the layman.
Data Matching_Feb_2017 Non_Matching_Feb_2017;
merge Feb_nbrs_without_dt(in=a) source_33_count(in=b);
by POLICY_No PREMIUM_BEG_DT;
if a and b then output Matching_Feb_2017;
if a and not b then output Non_Matching_Feb_2017;
/*if b and not a then output Non_Matching_Dec_2016;*/
run;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.