Re: Dataflux 2.8 Clustering Node - Clustering Conditions, Looking for ...

beamer108 · Posted 07-11-2022 03:16 PM

Hi

I've read the 'Help' documentation with how to use the clustering conditions but still am unclear on how to use it.

I have data that contains many variables such as key id, key list, file name, name, first name, last name, dob, address, phone number, email etc. My new data coming in needs to be matched up and clustered to existing data if any of these conditions are met. Key id would be the most ideal matching but unfortunately the data we receives comes from many sources which sometimes the Key id is only unique to that data source but not across all sources. For that reason we have to do matching on the other fields that come in.

Is there a hierarchy to how the matching is done in a cluster node - meaning if a match is found on the first condition does that mean the rest of the conditions are not considered?

What exactly does it mean for a 'cross match'?

This is what I have for my first clustering node:

I feel like my matches have become redundant - but because I don't truly understand how this cluster node works I add any match I can think of.

One other example within this that I have a question on is we have Key ID as a field but we also have Key List as a field in case a person has more then one Key id, all key ids would be added to the list. So in one condition I have Key ID + First Name + File Name, then a cross match to Key ID+ Key List + First Name + File name, cross match Key ID + Last Name + File Name, cross match Key ID + Key List + Last Name + File Name.

Can someone explain to me in simple terms what exactly I am matching on for that example?

If there is documentation that better explains Clustering Node and how its used in more detail with examples other then what appears in the Help selection within Dataflux please let me know and I'll reference that to get more knowledge. Otherwise I appreciate some guidance on how I should be using this cluster node more effectively.

Thank you

VincentRejany · Posted 07-11-2022 05:18 PM

Hi

Is there a hierarchy to how the matching is done in a cluster node - meaning if a match is found on the first condition does that mean the rest of the conditions are not considered?
--> No, all matching rules are evaluated

What exactly does it mean for a 'cross match'?
Typical scenario is if you have EMAIL1, EMAIL2 columns. With cross matching EMAIL1 values will be matched between them, EMAIL2 as well, but in addition, EMAIL1 will also be matched with EMAIL2

Regarding your scenario, I would suggest you add a source identifier (ie CRM -> 1, Call Center ->2, ERP -> 3), so you can match on Source System ID/Key.
A cross match on Key and Key list could make sense.

beamer108 · Posted 07-12-2022 12:13 PM

Thank you

We actually do have a source identifier labeled 'File Name' and we look for where the File Name and Key ID match that of a new record. Is that what you mean?

VincentRejany · Posted 07-12-2022 12:59 PM

Looking at your rules, I think you can keep only two (rule 1 and 3). Rules 2 and 4 are redundant, because records that match on R1 and R3 match also on R2 and R4

beamer108 · Posted 07-12-2022 01:10 PM

Thank you.

Dataflux 2.8 Clustering Node - Clustering Conditions, Looking for detailed help on how to use

Re: Dataflux 2.8 Clustering Node - Clustering Conditions, Looking for detailed help on how to use

Re: Dataflux 2.8 Clustering Node - Clustering Conditions, Looking for detailed help on how to use

Re: Dataflux 2.8 Clustering Node - Clustering Conditions, Looking for detailed help on how to use

Re: Dataflux 2.8 Clustering Node - Clustering Conditions, Looking for detailed help on how to use

Catch up on SAS Innovate 2026