For my work I've been asked to check the data quality of older legacy records against another separate database as there has been concerns on the quality of information when those were brought into the current system. I've currently mostly done match merging based on selected identifiers to provide the team with records that are consistent between the two datasets and separate out records that are failing on one of that list but I've been asked if it's possible to identify where the mismatch in records is occurring as some look like they should be merging properly but aren't. I do not know if it's possible for sas to output it's decision making process during the match but if so that would be very helpful.
Can you show the code you are using for the match merging?
If you are using a data step with a MERGE, it's pretty straight forward to output the records that don't match.
I'm not sure what you mean by "why" or the "decision-making process." If you've got a simple match-merge, the decision making process is, do they or do they not match on all the specified BY variables.
A small example of data you are comparing might also help.
data merged;
set x y;
by a b c d e f;
if in1 and in2;
run;
This is the basic structure of the code I'm running. outputting the ones that don't match isn't the issue. I'm trying to see if I can get SAS to show where things are not matching for the records versus being able to tell that the unmatched records are failing at a, b, c, d, e, or f but not knowing without physically checking which one is the fault point. I'm giving proc compare a try as recommended below but it's tended to crash the program as the files are several hundred thousand records each.
Whoop that is meant to be merge. Thank you for noting that.
I suggest you make a little example with, say 5-10 records in work.x and 5-10 records in work.y, with just three by-variables a b c.
Then think about how to operationalize the "fault point".
You would need a rule, like "if a record from work.x doesn't match to a record in work.y by A B C but it does match to a record by A B, then the fault point is C."
First, keep your non-matches:
data non-match;
merge
x (in=in1)
y (in=in2)
;
by a b c d e f;
if in1 ne in2;
run;
Then start by finding multiples when grouping along subsets of key variables:
proc sql;
create table try_b as
select *, count(*) as count
from non-match
group by a,c,d,e,f
having count > 1
;
quit;
See if this leads you somewhere.
I like @Kurt_Bremser 's approach for getting sense of your data, and starting to think about why records don't match.
I think another way to approach this would be as a fuzzy-merging problem. So you would take the mismatches from x and the mismatches from y, and then try to join them by fuzzy-merging by a,b,c,d,e,f. If you go to lexjansen.com and look up fuzzy merge or fuzzy join, you should get plenty of ideas.
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.