From Beginning to the End: Match Data and Extract the Result

Cyclist · Posted 06-01-2016 05:34 AM

Which building blocks and, which order and in which way need to be used to both match data from different (two) data sources and to extract the result with the correct data? Can you please describe in detail the process from the point of retreiving the data to the point of preparing the end result for further usage (my end result should be in tables on MS SQL Server).
I am having difficulties with understanding how SAS Data Mgmt Studo (DataFlux) works, and I cannot find online any documentation with good and straightforward expamples.

RonAgresta · Posted 06-02-2016 08:14 AM

Hi,

I'm going to assume the "result with the correct data" you are looking for is the combined "best view" of the matched data from both sources. Here goes:

Set up connections to your source and target data through the "Data" riser on the left in DM Studio.
In DM Studio, create a new data job
Use a data source node (for database tables) or other data input nodes to connect to the data in each source
Use a Data Union node to bring the data together in one path. Try to align the fields correctly where for example, name data from both sources ends up in the same column. Be sure to add the extra columns from one side or the other where columns don't exist on the other side.
If you are looking for an approximate match ("fuzzy"), use the Match Codes node on each columns that should be part of your matching rules. These match codes can be used as input to the clustering node described below.
Use a Clustering node to create match rules using a combination of match codes where you want fuzziness in the match and non-match code columns where you want exact matches. Choose the sort output option.
Once you have match clusters (these are groupings of records that appear, based on your match rules, to be similar enough to be considered the same data), use a Surviving Record Identification node to build rules to compose a "best record," which is a combination of the values you want from the matched rows, into a single row to represent all the data that has been matched in each cluster. The end result of this node should be single "best records" that represent unique data in your original two sources.
Select a Data Target (Insert) node to choose your SQL Server database as the target and write the results there.

If you haven't tracked down good examples on the SAS support site, one good alternative source is the online proceedings from past SAS Global Forums.

Ron

Cyclist · Posted 06-02-2016 08:20 AM

Hi Ron,

many thanks for your reply.

I have managed to do everything you said up to clustering part (inclusive).

Now I am at the point where my clusters have between few hunderts and few thousand rows.

I would really appreciate if you could give me the detailed explanation with a simple examle of the next step - using Surviving Record Identification node.

Regards!

d.

Cyclist · Posted 06-02-2016 08:55 AM

In additon to my previous question, I would also like to ask for the guidelines in matching the adresses from Contract_Config table to AddressMaster table.

The table Contract_Config contains different types of adresses (4 types) with the corresponding fields (see the attachment). Each of those 4 types of addresses has to be matched with the adressMaster table (which contains only one set of fields for address definition - seet the attachment).

I am at the stage where I have (a) defined the data connections, and (b) loaded the data with the data source nodes. How do I need to proceed, which nodes I need to use, and how can I join 4 different types of addresses from one table with adressmaster?

NOTE: the date in the contract config table is not clean. Meaning for example, LINE_LINE_STREET field contains not only information about the street, but also the information about the house/door number. Does that mean parsing has to be applied before Data Union step?

EDIT_1: I have also attached the screenshot of the data job. Is this the correct approach?

I have also attached the screenshot of the node Surviving Record Identification. Can you please help me with the definition?

Looking forward to your reply.

Best regards,

d.

RonAgresta · Posted 06-02-2016 11:48 AM

Read the documentation on cross-field clustering rules. That's the approach you will want to use to attempt to match addresses across columns.
Look into using a Parsing node before the Union step to break apart data into smaller tokens so you can align things better. If a parse definition does not exist for your scenario, you may have to use an Expression node and some regular expression code to help you out. Or, if you really want to get into advanced topics, you can use the Customize component and create custom QKB parse definitions to break apart the address data to meet your needs.
For the Surviving Record Identification node, if you add no rules, then one record from the cluster will be used to represent the entire cluster somewhat randomly. What you should really be doing is determining a) what record do you want to select as the "base" best record (record level rules) and b) determining what field values you want to end up in the best record (field level rules). The documentation for DM Studio describes this in more detail.

From Beginning to the End: Match Data and Extract the Result

Re: From Beginning to the End: March Data and Extract the Result

Re: From Beginning to the End: March Data and Extract the Result

Re: From Beginning to the End: March Data and Extract the Result

Re: From Beginning to the End: March Data and Extract the Result