BookmarkSubscribeRSS Feed
jmarnaez
Fluorite | Level 6

For my work I've been asked to check the data quality of older legacy records against another separate database as there has been concerns on the quality of information when those were brought into the current system. I've currently mostly done match merging based on selected identifiers to provide the team with records that are consistent between the two datasets and separate out records that are failing on one of that list but I've been asked if it's possible to identify where the mismatch in records is occurring as some look like they should be merging properly but aren't. I do not know if it's possible for sas to output it's decision making process during the match but if so that would be very helpful.

8 REPLIES 8
Quentin
Super User

Can you show the code you are using for the match merging?

 

If you are using a data step with a MERGE, it's pretty straight forward to output the records that don't match.  

 

I'm not sure what you mean by "why" or the "decision-making process."  If you've got a simple match-merge, the decision making process is, do they or do they not match on all the specified BY variables.

 

A small example of data you are comparing might also help.

jmarnaez
Fluorite | Level 6

data merged;

set x y;

by a b c d e f;

if in1 and in2;

run;

 

This is the basic structure of the code I'm running. outputting the ones that don't match isn't the issue. I'm trying to see if I can get SAS to show where things are not matching for the records versus being able to tell that the unmatched records are failing at a, b, c, d, e, or f but not knowing without physically checking which one is the fault point. I'm giving proc compare a try as recommended below but it's tended to crash the program as the files are several hundred thousand records each.

Reeza
Super User
Several hundred thousand records should not be an issue, unless you're using Academics on Demand.

You're not really merging here with a SET statement, you're interleaving. Are you sure you have the correct code?
What happens if you change SET to MERGE instead?

jmarnaez
Fluorite | Level 6

Whoop that is meant to be merge. Thank you for noting that.

Quentin
Super User

I suggest you make a little example with, say 5-10 records in work.x and 5-10 records in work.y, with just three by-variables a b c.

 

Then think about how to operationalize the "fault point".  

 

You would need a rule, like "if a record from work.x doesn't match to a record in work.y by A B C but it does match to a record by A B, then the fault point is C."

Kurt_Bremser
Super User

First, keep your non-matches:

data non-match;
merge
  x (in=in1)
  y (in=in2)
;
by a b c d e f;
if in1 ne in2;
run;

Then start by finding multiples when grouping along subsets of key variables:

proc sql;
create table try_b as
  select *, count(*) as count
  from non-match
  group by a,c,d,e,f
  having count > 1
;
quit;

See if this leads you somewhere.

Quentin
Super User

I like @Kurt_Bremser 's approach for getting  sense of your data, and starting to think about why records don't match.

 

I think another way to approach this would be as a fuzzy-merging problem.  So you would take the mismatches from x and the mismatches from y, and then try to join them by fuzzy-merging by a,b,c,d,e,f.  If you go to lexjansen.com and look up fuzzy merge or fuzzy join, you should get plenty of ideas.

Reeza
Super User
PROC COMPARE may be what you're looking for with a BY statement.

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 8 replies
  • 3221 views
  • 3 likes
  • 4 in conversation