Solved: Re: Question about Cluster Diff node

george7899 · Posted 08-18-2019 04:02 PM

I am having a hard time understanding the following:

If a record in the left table does not have a matching record in the right table, it is marked as "deleted". If a record in the right table does not have a matching record in the left, it is marked as "added".

1. What does it mean by saying marked as "deleted" and marked as "added"? Does it mean adding a new field in a Cluster Diff report with either "delete" or "added" flags? If so, what the purpose of this is?

2. Looks like DataFlux doesn't let me specify a match definition. So, how DataFlux is going to define if there isn't a matching record in the right table to the record in the left table, and vice versa?

3. Diff type Combine, Divide, Same and Network are typically shown in a Cluster Diff report. Apart from implying whether or not the records are in the same cluster in the left and right tables under the same Diff Set. What else does it tell us?

Thanks!

RonAgresta · Posted 08-20-2019 07:13 AM

The Cluster Diff node operates on two sets of matched ("clustered") records. Normally, this node is used to test changes to clusters from one run to the next. You may be testing matching rules and want to see how the different rules impact the membership of records in clusters, or you may use the same match rules from one run to the next, but your input set of records changed. Usually the change is additional records in the second run, as if an operational system in your organization is gaining more customers, orders, etc. over time. The key to all of this is the cluster identifiers that, when identical, indicate that per your matching criteria, the records are the same (or almost the same based on fuzzy matching). The second important part is that each record has a stable and unique row identifier.

Delete means that a record that was previously part of a cluster in the first set (or first run), is no longer present in the cluster in the second run. If a record is new to a cluster in the second run, it is marked as added.
You specify match criteria prior to using the Cluster Diff node. Usually a job will take input data, will indicate that some records ought to be fuzzy matched so a Match Code node will be used, and then a Clustering node will be used to actually bring potential matches together. This node is where you specify match criteria. It's this last step that generates cluster IDs that indicate records are the same. The cluster IDs in this node are the input to the Cluster Diff node.
Combine means that from one run to the next, two or more clusters collapsed into one cluster. If for example, in your matching rules you specified an OR match condition, it could be that from one run to the next, a match rule united two or more clusters together through a satisfied OR condition. Same means the cluster contents from one run to the next is identical. Network means that from one run to the next, the contents of a cluster were changed both by records moving and moving out. You will not normally see this when using the same match conditions from one run to the next but you could see it if you change match conditions altogether from one run to the next (which would be more common in the scenario described above where you are testing matching rules).

Ron

View solution in original post

RonAgresta · Posted 08-20-2019 07:13 AM

The Cluster Diff node operates on two sets of matched ("clustered") records. Normally, this node is used to test changes to clusters from one run to the next. You may be testing matching rules and want to see how the different rules impact the membership of records in clusters, or you may use the same match rules from one run to the next, but your input set of records changed. Usually the change is additional records in the second run, as if an operational system in your organization is gaining more customers, orders, etc. over time. The key to all of this is the cluster identifiers that, when identical, indicate that per your matching criteria, the records are the same (or almost the same based on fuzzy matching). The second important part is that each record has a stable and unique row identifier.

Delete means that a record that was previously part of a cluster in the first set (or first run), is no longer present in the cluster in the second run. If a record is new to a cluster in the second run, it is marked as added.
You specify match criteria prior to using the Cluster Diff node. Usually a job will take input data, will indicate that some records ought to be fuzzy matched so a Match Code node will be used, and then a Clustering node will be used to actually bring potential matches together. This node is where you specify match criteria. It's this last step that generates cluster IDs that indicate records are the same. The cluster IDs in this node are the input to the Cluster Diff node.
Combine means that from one run to the next, two or more clusters collapsed into one cluster. If for example, in your matching rules you specified an OR match condition, it could be that from one run to the next, a match rule united two or more clusters together through a satisfied OR condition. Same means the cluster contents from one run to the next is identical. Network means that from one run to the next, the contents of a cluster were changed both by records moving and moving out. You will not normally see this when using the same match conditions from one run to the next but you could see it if you change match conditions altogether from one run to the next (which would be more common in the scenario described above where you are testing matching rules).

Ron

george7899 · Posted 08-23-2019 01:13 AM

Thanks Ron. Very excellent answers. All clear now.

Question about Cluster Diff node

Re: Question about Cluster Diff node

Re: Question about Cluster Diff node

Re: Question about Cluster Diff node

Catch up on SAS Innovate 2026