06-07-2016 05:15 AM - edited 06-08-2016 01:37 AM
Could you please have a look at the data job I created.
The goal is to match adressess from CONFIG_CONTRACTXML table with the address master table (AT_address_export).
(I am completely new to SAS and am using it for around 10 day. So, the help and expert feedback would be very appreciated).
In particular, I would like to know is the definition in Data Union well made. The lowest level in the address hierarchy is the 'Door' level. In the node 'Data Union' I joined the 2 data sources to the lowest level (door). Is that correct? or should I go to 'House Number'?
Are Matching codes well defined?
(a) in regard to the definition of Heuse_Number, Stairway and Door is it correct to take 'Address' as definition?
(b) when doing the matching, is it suggested to go to the lowest level (door)? Or maybe House Number would be better?
Are clusters well/reasonable defined?
Surviving record as well?
I would also need the help with interpreting the final result: Contract_IDs in the CONFIG_CONTRACTXML table can show up multiple times (they are not unique). What do I need to do to make a statement about how SAS improved the data quality with this job? Help me, help you sell SAS Software I am using SAS Data Mgmt Studio 2.7
And this is a crutial point, because the output I see is a table which contains the number of rows equal to the sum of the number of rows of two data sources. The big question is: if the adressmatch before was around 65%, to which extent did SAS help improve the address matching? How can I evaluate that based on the end result (the one table on MS SQL Server)? What is the percentage increase in matching after using SAS software?
Looking forward to more detailed feedback.
UPDATE: 8th, June, 2016
I have updated the Data Union node, where I made the join to the lowest address level - door (see the picture in the attachment). I do not understand why I only hava NULL values in the final result. Can anybody explain this please?
06-15-2016 12:30 PM
To answer a few of your questions:
06-16-2016 04:33 AM
Thank you for your reply.
Here is another question: I am matching City_Name, Postal_CD, Street_name, Building_number, Door_Number. The fact ist, that only 10% of my data has door numbers. but I had to define 2 clusters:
The problem is, if I define ONLY first one, the software will match only 10% of my data. On the other hand, if there is ONLY the second one, I get very large clusters (for example, when there are no defined door_numbers for a large building or door number is unknown. I am talking about clusters of size 7000 rows).
So, is there any possibility to tell the software to do the following
06-16-2016 08:10 AM
You have two cluster rules, one of which completely supersedes the other. When used together, your second rule will always define the clusters members because it is less specific than the first. The way clustering works in the application, it doesn't stop if one of the conditions is met, it will always process both.
It's curious that you don't have an indivdual's name (or business name) in your matching rules. Your first rule potentially matches all residents sharing an address while the second would match all residents of a building. Name, gender, phone number, or some other uniquely identifying information used in match rules would help your results.
You could use the Cluster Analysis and Sub-Clustering nodes after the main clustering step (using cluster rule 2) to further score the results in each cluster, using door_number as an element in one of the evaluation rules.
06-16-2016 08:41 AM - edited 06-16-2016 02:39 PM
out of 161.000 records in the table which data quality I try to improve by matching the address data with the adress master, there are 31.000 records which contain first and last name. All other records have NULL in those fields.
here are all fields in the CONTRACTS table:
Fields in Address Master:
Finaly, what is your suggestion for defining cluster rules? If I understood you right, the way I defined them, is not optimal (as I told you my clusters have approx. 7000 rows)