Match Codes, Clustering, Surviving Record Identification in Viya 3.4… There’s a CAS Action for that

6 Likes

Do you remember the commercial about the iPhone 3G and its wonderful slogan “There’s an app for that”?

When I teach SAS Viya during workshops, I face lots of interesting questions regarding CAS. And one of my favorite answers is “there’s a CAS action for that”. Yes, there’s a CAS action for just about… anything.

There are hundreds of CAS actions, probably close to 900!

When it comes to data quality operations, one can start to find helpful CAS actions to perform common tasks. Let’s review some of them using the common DQ trio: creating match codes, clustering data and identify surviving record.

Creating matching codes

In Viya 3.4, there is a “match” CAS action, part of the new entityRes CAS action set. I thought initially that this CAS action would have helped me create match codes for some variables. But that's not the case.

The “match” CAS action is aimed to perform basic clustering based on the actual value of some fields (as opposed to match code values), sort of by-group processing. It will help on the second step but not on the match code creation.

So, today, the only option to create match codes on a CAS table is to use the data step with the Data Quality functions (DQ functions documentation). And you remember, there’s a CAS action for that.

Here is an example (I voluntarily chose to show the use of the dataStep.runCode CAS action; keep in mind that when you run a standard data step in CAS from a SAS client, it also runs a “hidden” CAS action behind the scenes called dataStep.runBinary😞

proc cas ;                                               
   datastep.runCode submit /
      code="data dm.customers_dq(copies=0 replace=yes) ;
               set dm.customers ;
               length nameSTD addressSTD nameMC addressMC varchar(100) ;
               nameSTD=dqStandardize(name,'Name','ENUSA');
               addressSTD=dqStandardize(address,'Address','ENUSA');
               nameMC=dqMatch(nameSTD,'Name',85,'ENUSA') ;   
               addressMC=dqMatch(addressSTD,'Address',85,'ENUSA') ;   
            run ;" ;                             
quit ;

Clustering

OK, I have my match codes. I want to cluster my customers based on their match codes for the name and address fields. How could I do that?

I can certainly use the “match” CAS action (documented here), as shown below:

proc cas ;
   entityRes.match   /
      clusterId="ClusterId"
      columns={"id","name","address","updatedate","nameMC","addressMC"}
      inTable={caslib="dm" name="customers_dq"}
      matchRules={{rule={{columns={"nameMC","addressMC"}}}}}
      outTable={caslib="dm" name="customers_dq_c",replace=true} ;
quit ;

Here, the CAS action:

reads the customers_dq table that contains the match codes (inTable= option)
selects a couple of columns, including the columns that will be used in my clustering condition (columns= option)
clusters rows based on the nameMC and addressMC columns (matchRules= option)
affects a unique identifier to each cluster or group of related rows to the ClusterId column (clusterId= option) – this identifier is of type CHAR where the value is a 24 character ASCII string, rather than an integer
creates an output table called customers_dq_c (outTable= option)

That does the job pretty well:

But I’d like to introduce a new CAS action (in Viya 3.4) that is really useful and powerful regarding by-groups management. It is called “groupByInfo”.

groupByInfo helps you:

identify by-groups by assigning a numeric and consecutive group id to the by-groups
get additional metadata on the by-groups, like the group frequencies, the position of each record within the by-group
remove duplicate records
“prototype data”, i.e. reduce the number of records inside a by-group while keeping cardinality

It does not require performing computations on some numeric columns. Moreover, you can easily add fields to your output without requiring them to be part of the by-groups.

Let’s have a look at this CAS action:

proc cas ;
   simple.groupByInfo /
      table={caslib="dm",name="customers_dq",groupBy={"nameMC","addressMC"}}
      copyVars={"id","name","address","updatedate"}
      casOut={caslib="dm",name="customers_dq_grpBy",replace=true}
      includeDuplicates=true
      generatedColumns={"FREQUENCY","GROUPID","POSITION"}
      details=true ;
quit ;

In this CAS action:

We specify the source table and the by-group variables (table= option)
We add some more variables from the source table to the target table (copyVars= option) – those variables do not contribute to the by-groups
We define the target table (casOut= option)
We want to keep duplicates in the output table (includeDuplicates= option)
We want some additional and important metadata (generatedColumns= option):
- FREQUENCY: number of rows in the by-group
- GROUPID: numeric and consecutive unique identifier of the by-group
- POSITION: position of the record within the by-group
Write some messages in the log, such as the number of groups on each CAS worker (details= option)

This CAS action has many more options. Here is a sample output of this code:

The first 3 columns are very interesting to do further data manipulation. This CAS action also provides several options to control the output of the table. For example:

Only output clusters that have more (or less) than N records (_frequency_)
Only outputs records that have a specific position, or between 2 positions (_position_)
Choose which position you want to keep, in case you want to remove duplicates

You can also do that by using basic filters. It’s very flexible.

But what is very interesting is how this CAS action works behind the scenes:

No data is moved across the workers; there is no on-the-fly partitioning
There is no sorting
The copyVars list does not affect performance; you can have many variables defined here, it won’t penalize your performance
Output data distribution is equivalent to the input data distribution; each CAS worker writes its own data blocks locally; if the input table is partitioned, the output table keeps the same organization

How is it possible?

There are 2 passes of the data, the first one for computing the by-groups information at the thread level, the second for setting the final numbers and writing the output table
Between the 2 passes, information from each CAS worker thread is sent to the controller that performs the global computations before sending this back to the workers for the second pass

Very clever.

This CAS action goes beyond simple DQ clustering capabilities. It can help in many situations where you need to know the by-group structure, remove duplicates, build dimension tables, sample data, etc.

It does have limitations like the maximum number of threads it can use, or some performance issues with high-cardinality by-groups.

I encourage you to look at this CAS action’s documentation.

Surviving Record Identification

As I mentioned earlier, we can take advantage of the groupByInfo CAS action to remove duplicates using simple rules, in addition to the clustering phase (in one single call):

proc cas ;
   simple.groupByInfo /
      table={caslib="dm",name="customers_dq",groupBy={"nameMC","addressMC"}}
      copyVars={"id","name","address","updatedate"}
      casOut={caslib="dm",name="customers_dq_grpBy",replace=true}
      position=3
      minFrequency=2
      generatedColumns={"FREQUENCY","GROUPID","POSITION"}
      details=true ;
quit ;

Here:

I only work with multi-record clusters (minFrequency=2)
I only want to output records with _position_=3 (if the cluster has less than 3 records, the record with _position_=1 will be automatically output)

If you want to have more control in the rules used to identify the survivor record, you can revert back to the data step CAS action:

proc cas ;                                               
   datastep.runCode submit /
      code="data dm.multiobsclusters(copies=0 replace=yes) ;
               set dm.customers_dq_grpby(drop=nameMC addressMC where=(_frequency_>=2)) ;
               keep_flag=0 ;
               by _groupid_ updatedate ;
               if last._groupid_ then keep_flag=1 ;
            run ;" ;                             
quit ;

Here:

I read the multi-record clusters from the output of my clustering phase output table
I flag the records I want to keep, rather than remove duplicates
I want to keep the records with the latest update date

Obviously, you can create much more complex rules.

That’s enough for today. Here is a link to the documentation:

Viya 3.4 CAS Actions documentation

And don’t forget: there’s a CAS action for just about… anything.