In a previous post I summarized the tasks and procedures available in SAS Visual Data Mining and Machine Learning. In this post, I'll dive into the unsupervised learning category which currently hosts several tasks: Kmeans, Kmodes, and Kprototypes Clustering, Outlier Detection, and a few variants of Principal Component Analysis.
About Unsupervised Learning
In unsupervised learning there are no known labels (outcomes), only attributes (inputs). Examples include clustering, association, and segmentation. Machine learning finds high density areas (in multidimensional space) that are more or less similar to each other, and identifies structures in the data that separate these areas.
Business cases where clustering is used include customer segmentation, text topic detection, and recommendations. Business cases where outlier detection is used include fraud detection, insider threat, and cybersecurity.
Kmeans, Kmodes, and Kprototypes Clustering (PROC KCLUS)
Clustering techniques divide your data into distinct groups based on features of the data. Marketers may be interested in dividing their customer base into groups for targeted marketing campaigns. Or politicians might be interested in finding groups of voters so that they can target these different groups more effectively with their election campaigns. In the hypothetical notional example illustrated in the graph above, four attributes were considered, and two of these attributes are graphed. On the horizontal axis is the feature HatesToPress1ForEnglish, and on the vertical axis is the feature HasEverWornBirkenstocks.
The PROC KCLUS procedure uses the k-means algorithm for clustering interval input variables, the k-modes algorithm for clustering nominal input variables, and the k-prototypes algorithm to cluster mixed input variables. Interval inputs are numeric; examples include height, weight, and temperature. Nominal input variables are categories; examples include gender, automobile make and model, and job category.
PROC KCLUS first randomly identifies the cluster centroids using least squares (L2) estimation. Then it computes distances of individual points to the cluster centroids. It repeats this process iteratively to find the best clusters, i.e., those clusters that minimize within-cluster variability and maximize between-cluster variability. Each iteration reduces the least squares criterion for the Euclidean distance until convergence or until the maximum iteration number is reached. The observations are divided into clusters such that every observation belongs to one and only one cluster. You can use the aligned box criterion (ABC) method to estimate the number of clusters.
PROC KCLUS runs in CAS, and so it reads and writes data in distributed form, and performs clustering and scoring in parallel by making full use of multicore computers or distributed computing environments.
Principal Component Analysis (the PCA, RPCA, and MWPCA procedures)
Principal component analysis is a multivariate technique for examining relationships among several quantitative variables. It helps to reduce the curse of dimensionality by creating new composite variables (principal components) that capture as much information as possible from the original inputs.
PROC PCA performs principal component analysis in SAS Viya. It can calculate principal components in three ways (links go to Wikipedia):
Eigenvalue decomposition is more efficient when you want to calculate all principal components, whereas the NIPALS method is faster if you want to extract only the first few principal components.
PROC RPCA implements robust principal component analysis (RPCA) in SAS Viya. PROC RPCA decomposes the input matrix into a sum of two matrices: a low-rank matrix and a sparse matrix. Robustness in RPCA comes from the property that the principal components are computed from observations after removing the outliers—that is, from the low-rank matrix.
PROC MWPCA implements moving windows robust principal component analysis (MWPCA) in SAS Viya. You can use this procedure to capture changes in principal components over time by using sliding windows, and you can perform RPCA on each window.
Outlier Detection (PROC SVDD)
PROC SVDD implements the support vector data description (SVDD) algorithm to perform outlier detection in SAS Viya. An SVDD model is obtained by building a minimum-radius hypersphere around the one-class training data (that is, no target is used) based on either a stochastic subset solver to quickly give you an approximate solution or an active-set solver that provides a more accurate solution.
I hope that this has been helpful. In future posts, I'll describe the tasks that make up the Supervised Learning models -- there are many more of those to cover!