BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
george7899
Fluorite | Level 6

I am having a hard time understanding the following: 

 

If a record in the left table does not have a matching record in the right table, it is marked as "deleted". If a record in the right table does not have a matching record in the left, it is marked as "added".

 

1. What does it mean by saying marked as "deleted" and marked as "added"? Does it mean adding a new field in a Cluster Diff report with either "delete" or "added" flags? If so, what the purpose of this is?

 

2. Looks like DataFlux doesn't let me specify a match definition. So, how DataFlux is going to define if there isn't a matching record in the right table to the record in the left table, and vice versa?

 

3. Diff type Combine, Divide, Same and Network are typically shown in a Cluster Diff report. Apart from implying whether or not the records are in the same cluster in the left and right tables under the same Diff Set. What else does it tell us?

 

Thanks!

1 ACCEPTED SOLUTION

Accepted Solutions
RonAgresta
SAS Employee

The Cluster Diff node operates on two sets of matched ("clustered") records. Normally, this node is used to test changes to clusters from one run to the next. You may be testing matching rules and want to see how the different rules impact the membership of records in clusters, or you may use the same match rules from one run to the next, but your input set of records changed. Usually the change is additional records in the second run, as if an operational system in your organization is gaining more customers, orders, etc. over time. The key to all of this is the cluster identifiers that, when identical, indicate that per your matching criteria, the records are the same (or almost the same based on fuzzy matching). The second important part is that each record has a stable and unique row identifier.

 

  1. Delete means that a record that was previously part of a cluster in the first set (or first run), is no longer present in the cluster in the second run. If a record is new to a cluster in the second run, it is marked as added.
  2. You specify match criteria prior to using the Cluster Diff node. Usually a job will take input data, will indicate that some records ought to be fuzzy matched so a Match Code node will be used, and then a Clustering node will be used to actually bring potential matches together. This node is where you specify match criteria. It's this last step that generates cluster IDs that indicate records are the same. The cluster IDs in this node are the input to the Cluster Diff node.
  3. Combine means that from one run to the next, two or more clusters collapsed into one cluster. If for example, in your matching rules you specified an OR match condition, it could be that from one run to the next, a match rule united two or more clusters together through a satisfied OR condition. Same means the cluster contents from one run to the next is identical. Network means that from one run to the next, the contents of a cluster were changed both by records moving and moving out. You will not normally see this when using the same match conditions from one run to the next but you could see it if you change match conditions altogether from one run to the next (which would be more common in the scenario described above where you are testing matching rules).

Ron

View solution in original post

2 REPLIES 2
RonAgresta
SAS Employee

The Cluster Diff node operates on two sets of matched ("clustered") records. Normally, this node is used to test changes to clusters from one run to the next. You may be testing matching rules and want to see how the different rules impact the membership of records in clusters, or you may use the same match rules from one run to the next, but your input set of records changed. Usually the change is additional records in the second run, as if an operational system in your organization is gaining more customers, orders, etc. over time. The key to all of this is the cluster identifiers that, when identical, indicate that per your matching criteria, the records are the same (or almost the same based on fuzzy matching). The second important part is that each record has a stable and unique row identifier.

 

  1. Delete means that a record that was previously part of a cluster in the first set (or first run), is no longer present in the cluster in the second run. If a record is new to a cluster in the second run, it is marked as added.
  2. You specify match criteria prior to using the Cluster Diff node. Usually a job will take input data, will indicate that some records ought to be fuzzy matched so a Match Code node will be used, and then a Clustering node will be used to actually bring potential matches together. This node is where you specify match criteria. It's this last step that generates cluster IDs that indicate records are the same. The cluster IDs in this node are the input to the Cluster Diff node.
  3. Combine means that from one run to the next, two or more clusters collapsed into one cluster. If for example, in your matching rules you specified an OR match condition, it could be that from one run to the next, a match rule united two or more clusters together through a satisfied OR condition. Same means the cluster contents from one run to the next is identical. Network means that from one run to the next, the contents of a cluster were changed both by records moving and moving out. You will not normally see this when using the same match conditions from one run to the next but you could see it if you change match conditions altogether from one run to the next (which would be more common in the scenario described above where you are testing matching rules).

Ron

george7899
Fluorite | Level 6
Thanks Ron. Very excellent answers. All clear now.

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to connect to databases in SAS Viya

Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 2 replies
  • 759 views
  • 2 likes
  • 2 in conversation