Your Toolbox for Unsupervised Machine Learning in SAS Viya

3 Likes

Unsupervised learning methods are a type of machine learning algorithms that work with data that have no historical labels or target variable. The system is not told the "right answer" and the algorithm must figure out what is being shown. The goal of unsupervised learning is to explore the data, uncovering hidden patterns within it and find some structures and relationships in the data.

Unsupervised learning methods are crucial in the field of machine learning, particularly for handling high-dimensional data, and are applied in various areas such as visualization, feature extraction, feature selection, data reduction, anomaly detection, association rules mining and more.

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

SAS Viya provides a rich suite of unsupervised machine learning techniques, available in both point-and-click and programming modes. These unsupervised learning methods provide powerful tools for understanding and utilizing high-dimensional data, enabling insights and applications across various domains.

Here's a tabular summary of unsupervised learning methods in SAS Viya:

Application / Task	Method / Technique	SAS Procedure	Model Studio Node	CAS Action	Product
Visualizing High Dimensional Data	t-distributed Stochastic Neighbor Embedding	PROC TSNE	Data Exploration node	tSne Action	SAS Machine Learning
Feature Extraction	Principal Component Analysis	PROC PCA	Feature Extraction node	eig Action, itergs Action, nipals Action, randompca Action	SAS Visual Statistics
	Singular Value Decomposition	PROC RPCA	Feature Extraction node	robustpca Action	SAS Machine Learning
	Robust Principal Component Analysis	PROC RPCA	Feature Extraction node	robustpca Action	SAS Machine Learning
	Moving Window Principal Component Analysis	PROC MWPCA	- - -	mwpca Action	SAS Machine Learning
	Autoencoding Neural Networks	PROC NNET	Feature Extraction node	annTrain Action	SAS Machine Learning
	Non-Negative Matrix Factorization	PROC NMF	- - -	nmf Action	SAS Visual Statistics
Feature Selection	Unsupervised Selection Method	PROC VARREDUCE	Variable Selection node	unsuper Action	SAS Visual Statistics
Feature Selection	Variable Clustering	PROC GVARCLUS	Variable Clustering node	gvarcluster Action	SAS Machine Learning
Data Reduction using Cluster Analysis	Partitive Clustering Using k-Means	PROC KCLUS	Clustering node	kClus Action	SAS Visual Statistics
	Model-Based Clustering	PROC MBC	- - -	mbcFit Action	SAS Visual Statistics
	Nonparametric Bayesian Gaussian Mixture Model	PROC GMM	- - -	gmm Action	SAS Machine Learning
Anomaly Detection	Support Vector Data Description	PROC SVDD	Anomaly Detection node	svddTrain Action	SAS Machine Learning
Anomaly Detection	Isolation Forest	PROC FOREST	- - -	forestTrain Action	SAS Visual Analytics
Association Rules Mining	Market Basket Analysis	PROC MBANALYSIS	- - -	mbanalysis Action	SAS Machine Learning

Let's briefly discuss about each one.

Visualizing High Dimensional Data

Unsupervised learning methods are instrumental in reducing the dimensionality of high-dimensional data for visualization, allowing us to interpret complex datasets.

t-distributed Stochastic Neighbour Embedding (t-SNE) is a machine learning algorithm for dimensionality reduction that is particularly well-suited for visualizing high-dimensional data sets, transforming them into two- or three-dimensional points while preserving the pairwise distances between closely neighboring observations and relaxing distances for non-neighboring observations. For further details, check out this post: "Visualizing High-Dimensional Data with t-SNE" on sas.com.

Feature Extraction

Feature extraction involves creating new features from raw data to improve model performance, with unsupervised methods helping to identify and extract important features.

Principal Component Analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components, with the goal of reducing the dimensionality of the data while retaining as much as possible of the variation present in the dataset. To learn how to perform principal component analysis using PROC PCA in SAS Viya, watch this video: "Principal Component Analysis Using the PCA Procedure in SAS Viya" on sas.com.
Singular Value Decomposition (SVD) is a matrix decomposition method that breaks down a matrix into three separate matrices, projecting high-dimensional document and term spaces into a lower-dimension space, and is often used in text mining to transform a term-by-document frequency matrix into a data set suitable for data mining purposes. For further details, check out this blog post: "The Singular Value Decomposition: A Fundamental Technique in Multivariate Data Analysis" on sas.com.
Robust Principal Component Analysis (RPCA) is a matrix decomposition algorithm that decomposes an input matrix into a low-rank matrix and a sparse matrix, which can be used for feature extraction and anomaly detection, respectively, and is particularly useful in handling data that might be noisy or contain outliers.
Moving Window Principal Component Analysis (MWPCA) implements principal component analysis over sliding windows of observations, allowing for the capture of changes in principal components over time and the detection of relative changes in parts of a system compared to the overall system.
Autoencoders are types of neural networks used for efficient codings, feature extraction, and nonlinear principal component analysis, which seeks to model its inputs through an architecture consisting of an input layer, hidden layers (encoding layers), and an output layer (decoding layer) that is a duplicate of the input layer. For more information check out this SAS tutorial "Unsupervised Learning Example: Autoencoders" on youtube.com.
Nonnegative Matrix Factorization (NMF) is a dimension reduction technique that approximately decomposes a nonnegative data matrix into two low-rank nonnegative factor matrices, often used for feature identification and extraction in various fields such as image processing, text mining, bioinformatics, and spectral data analysis.

Feature Selection

Feature selection focuses on choosing a subset of relevant features for model construction, enhancing interpretability and performance.

Unsupervised Selection Method identifies a set of input variables that jointly explain the maximum amount of data variance, without considering the target variable. Unlike PCA, this method reduces dimensionality by selecting a subset of the original variables, thus preserving model interpretation. To learn how to perform unsupervised variable reduction using the VARREDUCE procedure in SAS Viya, watch this video: "Unsupervised Variable Reduction Using the VARREDUCE Procedure in SAS Viya" on sas.com.
Variable Clustering groups input variables that are highly correlated, helping to reduce redundancy and collinearity in the data. This method involves performing graphical LASSO modeling, creating tables that include edge and vertex information, defining an undirected graph. In this way, the relationships among all the variables are expressed through an undirected graphical model. For more information check out this SAS tutorial "Feature Selection Using Graphical Lasso" on youtube.com.

Data Reduction Using Cluster Analysis

Data reduction aims to decrease the volume of data while preserving its informational content, facilitating efficient storage and processing.

Partitive Clustering partitions observations into clusters so that observations in the same cluster are similar and observations in different clusters are dissimilar, using methods like k-means, k-modes, and k-prototypes for interval, nominal, and mixed input variables respectively, and provides a technique called the aligned box criterion for estimating the number of clusters in the data table. To learn how to perform k-means clustering and segmentation in SAS Viya using PROC KCLUS, watch this video: "Unsupervised Segmentation Using the KCLUS Procedure in SAS Viya" on sas.com.
Model-Based Clustering (MBC), such as the Gaussian mixture model, is a soft clustering method that assumes each data point is generated from a mixture of normal distributions, requiring you to specify the number of clusters in advance and fit a fixed number of Gaussian components as many as the number of clusters. For further details, check out these posts: "Model-Based Clustering (Part 1): Exploring Its Significance" and "Model-Based Clustering (Part 2): A Detailed Look at the MBC Procedure".
Nonparametric Bayesian Gaussian Mixture Model, is a probabilistic model that assumes all data points are generated from a mixture of Gaussian distributions, generalizing k-means clustering to include information about the data's covariance and the centers of the latent Gaussians, thus providing soft clustering and allowing for a more flexible and data-driven determination of the number of mixture components, accommodating varying complexities in the underlying data distribution. For further details, check out this post: "Model-Based Clustering (Part 3): The GMM Procedure Demystified".

Anomaly Detection

Anomaly detection identifies unusual patterns that deviate from expected behavior.

Support Vector Data Description (SVDD) is a one-class classification technique that identifies a minimum-radius hypersphere around the training data, providing a geometric description of the data, and is particularly useful in applications where data for one class is abundant but scarce or missing for other classes, enabling outlier detection.
Isolation Forest is used for anomaly detection that identifies outliers by constructing a forest of decision trees, where anomalous observations are likely to have a shorter path from the root node to the leaf node than non-anomalous observations. For more information see this SAS paper: "Detecting Fraud and Other Anomalies Using Isolation Forests".

Association Rules Mining

Association rules mining uncovers interesting relationships or associations between variables in large datasets.

Market Basket Analysis (MBC) is a technique used in data mining that uses association rule mining to discover the co-occurrence relationships among a set of items, typically used to analyze customer purchasing patterns by identifying the items they buy together. For further details, check out this blog post: "Visualizing the Results of a Market Basket Analysis in SAS Viya" on sas.com.

Unsupervised learning plays a pivotal role in the machine learning field by enabling the extraction of valuable insights and patterns from unlabeled data. It facilitates data exploration, dimensionality reduction, clustering, anomaly detection, and feature learning, all of which are critical tasks in various domains such as finance, healthcare, and image recognition.

SAS boasts a rich suite of unsupervised learning methods, offering a comprehensive range of techniques and tools to tackle complex data analysis tasks effectively. From clustering algorithms like K-means to dimensionality reduction techniques such as PCA, SAS provides a robust framework for researchers and practitioners to explore, analyze, and derive meaningful insights from their data.

Find more articles from SAS Global Enablement and Learning here.

SAS Communities Library