One of the core features of any master data management solution is the ability to identify matching data in disparate data systems and resolve slight variations into a “best record” that is suitable for use by other users or applications. This best record is a business rule-derived combination of all the data elements from the different source systems rolled into a single authoritative representation.
A key concept in SAS Master Data Management (MDM) is the match cluster. A match cluster is made up of one or more records from contributing source systems and the derived best record. In other words, all records that match each other based on the conditions defined by the data steward end up in a cluster together. So if SAS MDM can find similar data across different systems and combine it into a consistent single view, what’s the problem?
While the goal is to be able to automatically and fully resolve all matching data into completely reliable match clusters, in practice this can be difficult to do. There will always be cases where one or more records grouped together look very similar to each other but are in fact unique. This is called over-matching and ideally it should be addressed by data stewards through a defined process that identifies and then corrects the specious matches.
The key to finding issues like these in SAS MDM is to create an over-match report that scans all clusters individually for data values that appear to be exceptions. For example, consider the following:
To look for and correct these kinds of over-matched records in the SAS MDM system data stewards can create an MDM Tool that identifies the problem match clusters. They can then use standard Recluster functionality to fix them. MDM Tools can expose any data flow or job logic through SAS MDM web application components that can be authored in batch jobs or real-time services in Data Management Studio. To design and use an MDM Tool for over-match correction, a data steward would do the following:
To begin, the data steward will design a real-time service in Data Management Studio that will perform the over-match processing. In this example, the data steward will build the tool for use with the standard sample INDIVIDUAL entity type. The new real-time service for an over-match report for an INDIVIDUAL entity type might implement this kind of data flow logic:
When complete, the real-time service will look something like this:
Now that the real-time data service is ready, use Data Management Studio to upload the service into the Real-Time Data Services/sasmdm directory on Data Management Server. Once the service is available on the server, the next step is to define the tool in your SAS MDM web application environment.
For the tool to be accessible in other parts of SAS MDM, either log out of the system and log back in or use the refresh action in your browser. This will refresh MDM metadata in all SAS MDM web application components.
Now that the new tool is in place, it can be invoked in the Master Data Management component of SAS MDM.
After the tool has been run successfully, a new tab with a table full of possible over-matched clusters appears in the Data Management Console.
The interesting columns in this table are the MDM_ENTITY_CLUSTER_ID column and the CONFIDENCE column. Records with the same MDM_ENTITY_CLUSTER_ID are in the same match cluster. The CONFIDENCE column has a value between 0 and 100. The closer it is to 100, the closer the match among the records in the match cluster. SAS MDM’s real-time service job logic dictates that any cluster with a potential bad match will appear in the report and the lower the CONFIDENCE value, the more likely it is that invalid matches have appeared in the given cluster.
In the screenshot provided, all of the clusters in this portion of the results have been flagged because the BIRTH_DATE information indicates that the individuals thought to be the same are actually different people living at the same location. It’s also possible that there is intentional fraudulent activity going on or that incorrect information is coming from one or more of the source system. In any of these cases, a data steward has been alerted to a suspicious issue that needs to be addressed.
Now that possible mismatches have been found, the next step is to correct the problem. Assuming we are really dealing with legitimate data and not invalid or intentionally incorrect data (circumstances that would involve further investigation), using standard SAS MDM functionality, data stewards can pull apart mismatched data and move it to other more appropriate clusters or to new unique clusters depending on the data.
By right-clicking on the report results and selecting Open Entity, data stewards can access the entire match cluster for further review.
Once in this view, data stewards can explore the record and determine if this match cluster really represents two unique individuals instead of one. If the data steward deems this to be the case, she can access the Recluster feature from the Action menu item. This brings the data steward to a screen where the original cluster is on the top and a new or target cluster is on the bottom.
In this example, the data steward has decided that the cluster really represents two distinct individual so she has chosen to move one of the contributing records to a brand new cluster. In doing so, a new best record is created and there are now two individuals recognized as master data in our system. In this way, a data steward can review and revise incorrectly matched records in an ongoing fashion as new data enters the SAS MDM system.
We have explored just one example of how MDM Tools can be used. Many data-centric MDM business requirements can be met by encapsulating data workflows as MDM Tools built in SAS Data Management Platform. Data Stewards may choose to build an entire set of MDM Tools to help find and correct issues in their master data repository.
What other tools would be useful additions to a data steward’s MDM toolbox?
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.