Re: SAS Data Job - Matching Addresses and Evaluatiing the End Result

Cyclist · Posted 06-07-2016 05:15 AM

Could you please have a look at the data job I created.
The goal is to match adressess from CONFIG_CONTRACTXML table with the address master table (AT_address_export).

(I am completely new to SAS and am using it for around 10 day. So, the help and expert feedback would be very appreciated).

In particular, I would like to know is the definition in Data Union well made. The lowest level in the address hierarchy is the 'Door' level. In the node 'Data Union' I joined the 2 data sources to the lowest level (door). Is that correct? or should I go to 'House Number'?

Are Matching codes well defined?

(a) in regard to the definition of Heuse_Number, Stairway and Door is it correct to take 'Address' as definition?

(b) when doing the matching, is it suggested to go to the lowest level (door)? Or maybe House Number would be better?

Are clusters well/reasonable defined?

Surviving record as well?

I would also need the help with interpreting the final result: Contract_IDs in the CONFIG_CONTRACTXML table can show up multiple times (they are not unique). What do I need to do to make a statement about how SAS improved the data quality with this job? Help me, help you sell SAS Software 🙂 I am using SAS Data Mgmt Studio 2.7

And this is a crutial point, because the output I see is a table which contains the number of rows equal to the sum of the number of rows of two data sources. The big question is: if the adressmatch before was around 65%, to which extent did SAS help improve the address matching? How can I evaluate that based on the end result (the one table on MS SQL Server)? What is the percentage increase in matching after using SAS software?

Looking forward to more detailed feedback.

Regards!

UPDATE: 8th, June, 2016

I have updated the Data Union node, where I made the join to the lowest address level - door (see the picture in the attachment). I do not understand why I only hava NULL values in the final result. Can anybody explain this please?

RonAgresta · Posted 06-15-2016 12:30 PM

To answer a few of your questions:

If your output has the same number of rows as the input tables combined, it's likely you do not have your matching rules set up correctly (or you have no duplicates at all - possible, but based on this post, it sounds like you are expecting matches). Loosen up your clustering rules and/or decrease sensitivity and examine the results. Does it look reasonable? If you are overmatching, tighten up the match criteria based on what you see in the results.
There is a "Cluster Diff" node that can be used to compare one set of match results to another. Use this to test matching rules sets against each other to determine if your clustering rules/sensitivities are getting you closer to desired results.
In the surviving record identification node there are options to "keep duplicate records" and "generate distinct surviving records." The former will let you collect records in a clutser of matches, rather than outputting only one survivor that represents the entire match cluster. The latter will create a completely new record to represent the best record based on the SRI rules you design. This can be useful when your requirements dictate that all orginal data must be preservered in the system. Using one or both of these options will have an effect on the number of rows you end up with.
I can't help you determine percentage increase in matching after you run your match process. I don't know what the initial value is, how you created it, or what it's measuring. In the abstract, you would want to count the total number of input rows and compare that to only the number of surviving/best record rows that come out of the process. Assuming that the matching results are reasonable, you could then determine the number of duplicates that the process helped identify and resolve.

Ron

Cyclist · Posted 06-16-2016 04:33 AM

Hi Ron,

Thank you for your reply.

Here is another question: I am matching City_Name, Postal_CD, Street_name, Building_number, Door_Number. The fact ist, that only 10% of my data has door numbers. but I had to define 2 clusters:

First Cluster: City_Name, Postal_CD, Street_name, Building_number, Door_Number
Second Cluster: City_Name, Postal_CD, Street_name, Building_number

The problem is, if I define ONLY first one, the software will match only 10% of my data. On the other hand, if there is ONLY the second one, I get very large clusters (for example, when there are no defined door_numbers for a large building or door number is unknown. I am talking about clusters of size 7000 rows).

So, is there any possibility to tell the software to do the following

(a) when the door_number does not exist, than match with the other table which has only one record in which "door_number='unknown'"
(b) when the door_number exists, than simply match with the door_number and try to find only one match.

Regards!

RonAgresta · Posted 06-16-2016 08:10 AM

You have two cluster rules, one of which completely supersedes the other. When used together, your second rule will always define the clusters members because it is less specific than the first. The way clustering works in the application, it doesn't stop if one of the conditions is met, it will always process both.

It's curious that you don't have an indivdual's name (or business name) in your matching rules. Your first rule potentially matches all residents sharing an address while the second would match all residents of a building. Name, gender, phone number, or some other uniquely identifying information used in match rules would help your results.

You could use the Cluster Analysis and Sub-Clustering nodes after the main clustering step (using cluster rule 2) to further score the results in each cluster, using door_number as an element in one of the evaluation rules.

Ron

Cyclist · Posted 06-16-2016 08:41 AM

Hi Ron,

out of 161.000 records in the table which data quality I try to improve by matching the address data with the adress master, there are 31.000 records which contain first and last name. All other records have NULL in those fields.

here are all fields in the CONTRACTS table:

,[CONTRACT_ID]
,[FIRSTNAME]
,[LASTNAME]
,[CITY]
,[STREET]
,[POSTAL_CD]
,[HOUSENUMBER]
,[STAIRWAY]
,[FLOOR]
,[DOOR]
,[SOURCE]

Fields in Address Master:

[APARTMENT_ID]
,[PRIMARY_APARTMENT_ID]
,[BUILDING_ID]
,[PRIMARY_BUILDING_ID]
,[CPR_HAUS_ID]
,[CPR_LOCATION_ID]
,[FLAT_LOCATION_FLAT_ID]
,[PAC]
,[CITY_NAME]
,[STREET_NAME]
,[BUILDINGNUMBER],
[ENTRANCE]
,[DOOR_NUMBER]
,[ZIP_CODE]
,[ADDRESS_TYPE]
,[BUILDINGCOORDINATEX]
,[BUILDINGCOORDINATEY]
,[COUNTRYCODEA2]

Finaly, what is your suggestion for defining cluster rules? If I understood you right, the way I defined them, is not optimal (as I told you my clusters have approx. 7000 rows)

Regards!

SAS Data Job - Matching Addresses and Evaluatiing the End Result