Custom Task Tuesday: Segmentation Modeling (All in One Analytic Task)

5 Likes

This post is another in a series of posts leading up to SAS Global Forum 2019. My colleagues, Elliot Inman and Ryan West, and I wrote a paper titled Kustomizing Your SAS Viya Engine Using SAS Studio Custom Tasks and D3.js. Custom Task Tuesday readers will get to preview the tasks associated with the paper, before the paper comes out! Once the paper does come out, I will add a link to it here.

This is the second task and article related to our paper. Check out the post for the first task if you're interested.

Vehicle Data

In our paper, we used data collected by the United States Environment Protection Agency (EPA) about cars. The EPA regularly tests new vehicles for fuel efficiency and emissions. The data can be downloaded as a CSV from FuelEconomy.gov. The data include vehicle make / model / year and detailed miles-per-gallon (MPG) and emissions test results. Check out the data dictionary here for more details.

The SAS data set we used in our analysis is available for download on the Task Tuesday GitHub. It includes only the variables that we used and has variables labels as well.

All-in-One Modeling Task

There are several built-in tasks in SAS Studio that each run a different supervised or unsupervised learning model. For example, you can open the SAS Viya Unsupervised Learning task “Clustering” and select your dataset and input variables and run that model. Then, you can open the “Decision Tree” task, select your data set and input variables and run that model. Then, you can open the “Forest” task and… you get the idea.

If you have a set of analytic processes that you want to run over and over on the same dataset, you can combine them all into one task and only make those selections once. This week’s task is essentially an example of an all-in-one analytic modeling task. The example will show results for the cars data from the EPA, but any data could be used.

Here’s what the task looks like:

seg model.png

This task goes through the process of running a clustering analysis, running a decision tree to get the variable importance, and running a forest model to look at how well the cluster ID predicted a certain metric. This process is repeated iteratively, increasing the maximum number of clusters each time. The complicated part of the task is actually the SAS code, while the task itself is quite simple (no dependencies, just role selectors, numsteppers, text boxes, and a check box).

There are 6 datasets that result from running this task:

CASUSER.Clusters: Output of all clustering runs in a wide format
CASUSER.Sample: Sample (from CASUSER.Clusters) of 1000 observations from each clustering run
CASUSER.ClusterTall: Output of all clustering runs transformed to be tall format
CASUSER.VarImportance: Output of Decision Tree variable importance for all clustering runs
CASUSER.ForestClus: Output of all forest runs using cluster ID to predict the target variable
CASUSER.ForestVars: Output of forest run using input variables to predict the target variable

Promote to VA Checkbox

The part of this task that I want to highlight that will be useful for task authors is the “Promote to VA” checkbox. This writes all of the output data structures to the PUBLIC caslib, which will make them available for use in Visual Analytics reports. The “promote=yes” option is added so that the table will persist beyond the current CAS session. For a deeper explanation of CAS table promotion, see this paper by Mike Drutar: Just Enough SAS® Cloud Analytic Services: CAS Actions for SAS® Visual Analytics Report Developers.

The Metadata code for the VA checkbox is here:

<Option name="GROUPVA" inputType="string">VISUAL ANALYTICS PROMOTION</Option>
<Option name="labelVA" inputType="string">Promoting the output data sets to the PUBLIC 
caslib will make them available for use in Visual Analytics. </Option>
<Option name="chkVA" defaultValue="0" inputType="checkbox">Promote to VA</Option>

The UI code for the VA checkbox is here:

<Group option="GROUPVA" open="true">
    <OptionItem option="labelVA"/>
    <OptionItem option="chkVA"/>
</Group>

And finally, the Code Template portion for the VA checkbox is here:

#if ($chkVA == 1)

proc datasets lib=public; delete clusterwide; run;
data public.clusterwide (promote=yes);
       set casuser.clusters;
run;

proc datasets lib=public; delete sample; run;
data public.sample (promote=yes);
       set casuser.sample;
run;

proc datasets lib=public; delete clustertall; run;
data public.clustertall (promote=yes);
       set casuser.clustertall;
run;

proc datasets lib=public; delete varimportance; run;
data public.varimportance (promote=yes);
       set casuser.varimportance;
run;

proc datasets lib=public; delete forestclus; run;
data public.forestclus (promote=yes);
       set casuser.forestclus;
run;

proc datasets lib=public; delete forestvars; run;
data public.forestvars (promote=yes);
       set casuser.forestvars;
run;

#end

Download the task from the Custom Task Tuesday GitHub to view all of the code. Can any of your tasks make use of the “Promote to VA” checkbox?

Take Me to GitHub!

Join the Conversation on Twitter

Use the hashtag #CustomTaskTuesday and tweet @OliviaJWright with your Custom Task comments and questions!

SAS Communities Library