Data-Driven Analytics in SAS Viya – Clustering

In today’s post I will continue to show you how easy it is to perform data-driven analytics in SAS Viya. This is the second in a series of posts that will use statistics and machine learning objects in SAS Visual Analytics to address real world business problems. Using SAS Viya, we will continue to focus in on the first two parts of the AI and Analytics lifecycle: managing data and developing models. Since we covered many highlights of managing data in the previous post, today we will move into developing models.

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

While there are many types of classification techniques (including semi-supervised and reinforcement learning), we will focus on both supervised and unsupervised methods that are currently available in SAS Visual Analytics. The main difference between the two sets of methods is in the presence or absence of a known output or target variable.

Unsupervised classification does not rely on having a target variable in the data (also known as unlabeled data). It focuses on finding structures or patterns within the input data, only. Clustering and dimension reduction are two prime tasks accomplished through unsupervised learning models. On the other hand, supervised classification involves learning from a dataset where the target variable or output is known. The goal of supervised classification is to learn the correlation or relationship between input and output variables. We typically feed a target along with several input (or descriptor) variables into a model that can then perform classification or prediction. With supervised models, the target variable can be a class variable (e.g., a binary like loan default, yes or no) or a continuous number (e.g., like dollars spent). Common supervised learning algorithms include regressions, decision trees, and neural networks.

Let’s begin our journey into the model building phase of the analytics life cycle by examining the unsupervised method of clustering.

Three common usages of clustering are finding groups with similar characteristics, product recommendation systems, and anomaly detection. Marketing strategies often involve the finding of clusters of customers that have different product affinities or product usage patterns. Clustering on groups with similar characteristics allow marketers to label those clusters and potentially find new sales. Products often fall into clusters of items that are frequently purchased together. We’ve all seen this when making online purchases and suggested products are offered for our “check-out basket”. And finally, clustering for anomaly detection makes it easy to identify which records might fall outside of all identifiable clusters. Those outliers could represent financial fraud, disease, or any other type of anomaly.

In our example of clustering, as a data scientist we’re attempting to use a variable annuity data table named develop_final to help us better understand our customers. I’m going to continue to use the same data that I introduced in the previous post. The develop_final table contains just over 32,000 banking customers and input variables that reflect both demographic information as well as product usage captured over a three-month period. The target variable is Ins which is a binary variable. For Ins, a value of 1 indicates the customer purchased a variable annuity product and a 0 indicates they did not. Please take note that I have performed some data clean-up (including binning, transforming, and imputation) and variable selection (using variance explained) so that we are ready to perform model building. If you’re interested in seeing some of those techniques performed on the develop table, please see Supervised Machine Learning Procedures Using SAS® Viya® in SAS® Studio.

From SAS Drive, we open the Visual Analytics web application by clicking the Applications menu icon and selecting Explore and Visualize. From the Explore and Visualize window, we click on New report. In the left-hand Data pane, select Add data. Find and select the DEVELOP_FINAL table and then Add.

With the data already cleaned and prepped for model building, we are ready to create our cluster. On the left, we’ll change from the Data pane to the Objects pane by selecting Objects. From this list we can scroll down until we find the list of Statistics objects. From there we can either double-click or drag-and-drop the Cluster object onto the first page of the report.

Before we assign data to roles for this cluster object, let’s note that by default 5 clusters will be created from our selected data using standard deviation as a method of standardizing the measure variables to a similar scale. Standardization is very important when clustering because should you have inputs with wildly different scales, those inputs with larger scales will dominate the distance calculations leading to bias. I’ve found that using the range for standardization tends to give me better separation in the two-dimensional cluster output that we will examine shortly. Select Range under Standardization in the right-hand Options pane. Even though the default of 5 clusters might be fine for our analysis, I’m going to go ahead and select Automatic (Aligned box criterion) under Number of clusters. This is a great option if we have no idea how many clusters are appropriate for our data. The Aligned Box Criterion (ABC) method estimates the number of clusters based on principal components of the input data.

In the center of the cluster object on the page, select Assign data and Add: Age, Credit Score, and Home Value. Then select Apply and then Close. In the Options pane on the right, scroll down and select ABC Statistics under Model Display -> General -> Displayed visuals. This will allow us to view the calculations that resulted in the selection of 3 clusters for our data. Also, under these General options, change the Plot layout from Fit (default) to Stack. This places each piece of output on a separate tab and maximizes the real estate available on the page for easier viewing.

The cluster diagram is a two-dimensional projection of each cluster onto cells that contain a pairing of all inputs. For example, the lower left cell contains a crossing of Age and Home Value. These projections help us spot cluster similarities and differences as we view each pairing of inputs. Each cluster is assigned a unique color and cluster ID. If we continue to examine the crossing of Age and Home Value, we can make a few observations about the clusters and the data. Cluster 1 (the large blue cluster) consists of the more expensive homes for all age ranges. Clusters 2 and 3 (yellow and purple, respectively) consist of the less expensive homes for two different age ranges. Cluster 2 ranges with ages from 40 to 85, while Cluster 3 ranges from 20 to 55. Even though each cluster is unique, it is not unexpected to see overlap in the cluster diagram. Remember, this chart is a two-dimensional view of a three-dimensional solution. Next, let’s examine the output provided for the aligned box criterion. Select ABC Statistics.

We can see that by default this calculation will consider from 2 to 6 total clusters. We could change that on the Options pane if necessary. The default estimation criterion for ABC is known as the “Global peak value.” It’s clear that 3 clusters gave the maximum Gap value for our data. Finally, let’s examine the Parallel Coordinates plot.

On the far-left side of the Parallel Coordinates plot we see the same three clusters that we saw on the Cluster Diagram. The clusters have the exact same cluster IDs and colors. Along the top of the grey, binned columns we see the three inputs that were used to create the clusters: Age, Credit Score, and Home Value. Each of the columns have been divided into 10 bins that include the data range. For example, ages range from 16 to 94 so each bin contains customers in an approximately 8-year age range. If we examine the header at the top, we can see that this plot contains 847 polylines. Follow the polylines from left to right to determine which range of values pertain to each cluster. The thickness of a line indicates the relative number of observations in that bin. For example, if we follow a couple of the thickest polylines for Cluster ID 3 (the purple cluster) from left to right we observe the following characteristics. A large number of customers in Cluster 3 are younger in Age, have middle-of-the-road credit scores, and lower home values.

Before we finish up this post, let’s not forget to open up the Details Table for some juicy tidbits of information about our cluster analysis. Select Maximize in the upper-right corner of the cluster object.

The Details Table of all the model objects available in SAS Visual Analytics contain a treasure trove of often overlooked information. On the Centroids tab we can find the centroid definition for each cluster. You can see these values reflected back in the first Cluster Diagram that we examined. Simply mouse over the large X in the middle of each cluster and these same numbers will appear. Select the Model Information tab.

On this tab we can see that the k-means clustering algorithm was used in the background. The k-means algorithm is a very popular clustering algorithm that allows you to specify how many clusters (k being an integer) should be created. It begins by randomly selecting ‘k’ initial cluster centroids known as seeds. It works by minimizing the within-cluster variance and iteratively updating the cluster centroids until convergence occurs. Advantages to using k-means clustering are that it is efficient and scales well to large data. Disadvantages are that it is sensitive to both the initial centroid selection and to outliers. The cluster object in SAS Visual Analytics uses k-means algorithm when all the inputs are interval. If all inputs are categorical, the k-modes algorithm is used. If you mix input types of data, there is a k-prototypes algorithm that is used.

I hope you’ve enjoyed our journey into the developing models portion of the AI and Analytics lifecycle. For now, clustering is the only unsupervised method that I plan to cover in this series. In my next post, I plan to introduce supervised analysis. If you would like to learn more about clustering, I can suggest the following two courses: SAS® Visual Statistics in SAS® Viya®: Interactive Model Building and Applied Clustering Techniques. See you next time and never stop learning!

Find more articles from SAS Global Enablement and Learning here.

Data-Driven Analytics in SAS Viya – Clustering

Register Today!

Free course: Data Literacy Essentials

Get Started