Do you remember the commercial about the iPhone 3G and its wonderful slogan “There’s an app for that”?
When I teach SAS Viya during workshops, I face lots of interesting questions regarding CAS. And one of my favorite answers is “there’s a CAS action for that”. Yes, there’s a CAS action for just about… anything.
There are hundreds of CAS actions, probably close to 900!
When it comes to data quality operations, one can start to find helpful CAS actions to perform common tasks. Let’s review some of them using the common DQ trio: creating match codes, clustering data and identify surviving record.
In Viya 3.4, there is a “match” CAS action, part of the new entityRes CAS action set. I thought initially that this CAS action would have helped me create match codes for some variables. But that's not the case.
The “match” CAS action is aimed to perform basic clustering based on the actual value of some fields (as opposed to match code values), sort of by-group processing. It will help on the second step but not on the match code creation.
So, today, the only option to create match codes on a CAS table is to use the data step with the Data Quality functions (DQ functions documentation). And you remember, there’s a CAS action for that.
Here is an example (I voluntarily chose to show the use of the dataStep.runCode CAS action; keep in mind that when you run a standard data step in CAS from a SAS client, it also runs a “hidden” CAS action behind the scenes called dataStep.runBinary😞
proc cas ;
datastep.runCode submit /
code="data dm.customers_dq(copies=0 replace=yes) ;
set dm.customers ;
length nameSTD addressSTD nameMC addressMC varchar(100) ;
nameSTD=dqStandardize(name,'Name','ENUSA');
addressSTD=dqStandardize(address,'Address','ENUSA');
nameMC=dqMatch(nameSTD,'Name',85,'ENUSA') ;
addressMC=dqMatch(addressSTD,'Address',85,'ENUSA') ;
run ;" ;
quit ;
OK, I have my match codes. I want to cluster my customers based on their match codes for the name and address fields. How could I do that?
I can certainly use the “match” CAS action (documented here), as shown below:
proc cas ;
entityRes.match /
clusterId="ClusterId"
columns={"id","name","address","updatedate","nameMC","addressMC"}
inTable={caslib="dm" name="customers_dq"}
matchRules={{rule={{columns={"nameMC","addressMC"}}}}}
outTable={caslib="dm" name="customers_dq_c",replace=true} ;
quit ;
Here, the CAS action:
That does the job pretty well:
But I’d like to introduce a new CAS action (in Viya 3.4) that is really useful and powerful regarding by-groups management. It is called “groupByInfo”.
groupByInfo helps you:
It does not require performing computations on some numeric columns. Moreover, you can easily add fields to your output without requiring them to be part of the by-groups.
Let’s have a look at this CAS action:
proc cas ;
simple.groupByInfo /
table={caslib="dm",name="customers_dq",groupBy={"nameMC","addressMC"}}
copyVars={"id","name","address","updatedate"}
casOut={caslib="dm",name="customers_dq_grpBy",replace=true}
includeDuplicates=true
generatedColumns={"FREQUENCY","GROUPID","POSITION"}
details=true ;
quit ;
In this CAS action:
This CAS action has many more options. Here is a sample output of this code:
The first 3 columns are very interesting to do further data manipulation. This CAS action also provides several options to control the output of the table. For example:
You can also do that by using basic filters. It’s very flexible.
But what is very interesting is how this CAS action works behind the scenes:
How is it possible?
Very clever.
This CAS action goes beyond simple DQ clustering capabilities. It can help in many situations where you need to know the by-group structure, remove duplicates, build dimension tables, sample data, etc.
It does have limitations like the maximum number of threads it can use, or some performance issues with high-cardinality by-groups.
I encourage you to look at this CAS action’s documentation.
As I mentioned earlier, we can take advantage of the groupByInfo CAS action to remove duplicates using simple rules, in addition to the clustering phase (in one single call):
proc cas ;
simple.groupByInfo /
table={caslib="dm",name="customers_dq",groupBy={"nameMC","addressMC"}}
copyVars={"id","name","address","updatedate"}
casOut={caslib="dm",name="customers_dq_grpBy",replace=true}
position=3
minFrequency=2
generatedColumns={"FREQUENCY","GROUPID","POSITION"}
details=true ;
quit ;
Here:
If you want to have more control in the rules used to identify the survivor record, you can revert back to the data step CAS action:
proc cas ;
datastep.runCode submit /
code="data dm.multiobsclusters(copies=0 replace=yes) ;
set dm.customers_dq_grpby(drop=nameMC addressMC where=(_frequency_>=2)) ;
keep_flag=0 ;
by _groupid_ updatedate ;
if last._groupid_ then keep_flag=1 ;
run ;" ;
quit ;
Here:
Obviously, you can create much more complex rules.
That’s enough for today. Here is a link to the documentation:
And don’t forget: there’s a CAS action for just about… anything.
Registration is open! SAS is returning to Vegas for an AI and analytics experience like no other! Whether you're an executive, manager, end user or SAS partner, SAS Innovate is designed for everyone on your team. Register for just $495 by 12/31/2023.
If you are interested in speaking, there is still time to submit a session idea. More details are posted on the website.
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.