SAS Communities Library

We’re smarter together. Learn from this collection of community knowledge and add your expertise.
BookmarkSubscribeRSS Feed

Your Toolbox for Unsupervised Machine Learning in SAS Viya

Started ‎12-15-2024 by
Modified ‎12-15-2024 by
Views 868

Unsupervised learning methods are a type of machine learning algorithms that work with data that have no historical labels or target variable. The system is not told the "right answer" and the algorithm must figure out what is being shown. The goal of unsupervised learning is to explore the data, uncovering hidden patterns within it and find some structures and relationships in the data.

 

Unsupervised learning methods are crucial in the field of machine learning, particularly for handling high-dimensional data, and are applied in various areas such as visualization, feature extraction, feature selection, data reduction, anomaly detection, association rules mining and more.

 

01_SS_UnSupervised-1-1024x563.jpg

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

 

 

02_SS_Unsupervised_Toolbox.png

SAS Viya provides a rich suite of unsupervised machine learning techniques, available in both point-and-click and programming modes. These unsupervised learning methods provide powerful tools for understanding and utilizing high-dimensional data, enabling insights and applications across various domains.

 

Here's a tabular summary of unsupervised learning methods in SAS Viya:

 

Application / Task Method / Technique SAS Procedure Model Studio Node CAS Action Product
Visualizing High Dimensional Data t-distributed Stochastic Neighbor Embedding PROC TSNE Data Exploration node tSne Action SAS Machine Learning
Feature Extraction Principal Component Analysis PROC PCA Feature Extraction node eig Action, itergs Action, nipals Action, randompca Action SAS Visual Statistics
Singular Value Decomposition PROC RPCA Feature Extraction node robustpca Action SAS Machine Learning
Robust Principal Component Analysis PROC RPCA Feature Extraction node robustpca Action SAS Machine Learning
Moving Window Principal Component Analysis PROC MWPCA - - - mwpca Action SAS Machine Learning
Autoencoding Neural Networks PROC NNET Feature Extraction node annTrain Action SAS Machine Learning
Non-Negative Matrix Factorization PROC NMF - - - nmf Action SAS Visual Statistics
Feature Selection Unsupervised Selection Method PROC VARREDUCE Variable Selection node unsuper Action SAS Visual Statistics
Variable Clustering PROC GVARCLUS Variable Clustering node gvarcluster Action SAS Machine Learning
Data Reduction using Cluster Analysis Partitive Clustering Using k-Means PROC KCLUS Clustering node kClus Action SAS Visual Statistics
Model-Based Clustering PROC MBC - - - mbcFit Action SAS Visual Statistics
Nonparametric Bayesian Gaussian Mixture Model PROC GMM - - - gmm Action SAS Machine Learning
Anomaly Detection Support Vector Data Description PROC SVDD Anomaly Detection node svddTrain Action SAS Machine Learning
Isolation Forest PROC FOREST - - - forestTrain Action SAS Visual Analytics
Association Rules Mining Market Basket Analysis PROC MBANALYSIS - - - mbanalysis Action SAS Machine Learning

 

Let's briefly discuss about each one.

 

Visualizing High Dimensional Data

 

Unsupervised learning methods are instrumental in reducing the dimensionality of high-dimensional data for visualization, allowing us to interpret complex datasets.

 

  • t-distributed Stochastic Neighbour Embedding (t-SNE) is a machine learning algorithm for dimensionality reduction that is particularly well-suited for visualizing high-dimensional data sets, transforming them into two- or three-dimensional points while preserving the pairwise distances between closely neighboring observations and relaxing distances for non-neighboring observations. For further details, check out this post: "Visualizing High-Dimensional Data with t-SNE" on sas.com.

 

Feature Extraction

 

Feature extraction involves creating new features from raw data to improve model performance, with unsupervised methods helping to identify and extract important features.

 

  • Principal Component Analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components, with the goal of reducing the dimensionality of the data while retaining as much as possible of the variation present in the dataset. To learn how to perform principal component analysis using PROC PCA in SAS Viya, watch this video: "Principal Component Analysis Using the PCA Procedure in SAS Viya" on sas.com.
  • Singular Value Decomposition (SVD) is a matrix decomposition method that breaks down a matrix into three separate matrices, projecting high-dimensional document and term spaces into a lower-dimension space, and is often used in text mining to transform a term-by-document frequency matrix into a data set suitable for data mining purposes. For further details, check out this blog post: "The Singular Value Decomposition: A Fundamental Technique in Multivariate Data Analysis" on sas.com.
  • Robust Principal Component Analysis (RPCA) is a matrix decomposition algorithm that decomposes an input matrix into a low-rank matrix and a sparse matrix, which can be used for feature extraction and anomaly detection, respectively, and is particularly useful in handling data that might be noisy or contain outliers.
  • Moving Window Principal Component Analysis (MWPCA) implements principal component analysis over sliding windows of observations, allowing for the capture of changes in principal components over time and the detection of relative changes in parts of a system compared to the overall system.
  • Autoencoders are types of neural networks used for efficient codings, feature extraction, and nonlinear principal component analysis, which seeks to model its inputs through an architecture consisting of an input layer, hidden layers (encoding layers), and an output layer (decoding layer) that is a duplicate of the input layer. For more information check out this SAS tutorial "Unsupervised Learning Example: Autoencoders" on youtube.com.
  • Nonnegative Matrix Factorization (NMF) is a dimension reduction technique that approximately decomposes a nonnegative data matrix into two low-rank nonnegative factor matrices, often used for feature identification and extraction in various fields such as image processing, text mining, bioinformatics, and spectral data analysis.

 

Feature Selection

 

Feature selection focuses on choosing a subset of relevant features for model construction, enhancing interpretability and performance.

 

  • Unsupervised Selection Method identifies a set of input variables that jointly explain the maximum amount of data variance, without considering the target variable. Unlike PCA, this method reduces dimensionality by selecting a subset of the original variables, thus preserving model interpretation. To learn how to perform unsupervised variable reduction using the VARREDUCE procedure in SAS Viya, watch this video: "Unsupervised Variable Reduction Using the VARREDUCE Procedure in SAS Viya" on sas.com.
  • Variable Clustering groups input variables that are highly correlated, helping to reduce redundancy and collinearity in the data. This method involves performing graphical LASSO modeling, creating tables that include edge and vertex information, defining an undirected graph. In this way, the relationships among all the variables are expressed through an undirected graphical model. For more information check out this SAS tutorial "Feature Selection Using Graphical Lasso" on youtube.com.

 

Data Reduction Using Cluster Analysis

 

Data reduction aims to decrease the volume of data while preserving its informational content, facilitating efficient storage and processing.

 

  • Partitive Clustering partitions observations into clusters so that observations in the same cluster are similar and observations in different clusters are dissimilar, using methods like k-means, k-modes, and k-prototypes for interval, nominal, and mixed input variables respectively, and provides a technique called the aligned box criterion for estimating the number of clusters in the data table. To learn how to perform k-means clustering and segmentation in SAS Viya using PROC KCLUS, watch this video: "Unsupervised Segmentation Using the KCLUS Procedure in SAS Viya" on sas.com.
  • Model-Based Clustering (MBC), such as the Gaussian mixture model, is a soft clustering method that assumes each data point is generated from a mixture of normal distributions, requiring you to specify the number of clusters in advance and fit a fixed number of Gaussian components as many as the number of clusters. For further details, check out these posts: "Model-Based Clustering (Part 1): Exploring Its Significance" and "Model-Based Clustering (Part 2): A Detailed Look at the MBC Procedure".
  • Nonparametric Bayesian Gaussian Mixture Model, is a probabilistic model that assumes all data points are generated from a mixture of Gaussian distributions, generalizing k-means clustering to include information about the data's covariance and the centers of the latent Gaussians, thus providing soft clustering and allowing for a more flexible and data-driven determination of the number of mixture components, accommodating varying complexities in the underlying data distribution. For further details, check out this post: "Model-Based Clustering (Part 3): The GMM Procedure Demystified".

 

Anomaly Detection

 

Anomaly detection identifies unusual patterns that deviate from expected behavior.

 

  • Support Vector Data Description (SVDD) is a one-class classification technique that identifies a minimum-radius hypersphere around the training data, providing a geometric description of the data, and is particularly useful in applications where data for one class is abundant but scarce or missing for other classes, enabling outlier detection.
  • Isolation Forest is used for anomaly detection that identifies outliers by constructing a forest of decision trees, where anomalous observations are likely to have a shorter path from the root node to the leaf node than non-anomalous observations. For more information see this SAS paper: "Detecting Fraud and Other Anomalies Using Isolation Forests".

 

Association Rules Mining

 

Association rules mining uncovers interesting relationships or associations between variables in large datasets.

 

  • Market Basket Analysis (MBC) is a technique used in data mining that uses association rule mining to discover the co-occurrence relationships among a set of items, typically used to analyze customer purchasing patterns by identifying the items they buy together. For further details, check out this blog post: "Visualizing the Results of a Market Basket Analysis in SAS Viya" on sas.com.

 

Unsupervised learning plays a pivotal role in the machine learning field by enabling the extraction of valuable insights and patterns from unlabeled data. It facilitates data exploration, dimensionality reduction, clustering, anomaly detection, and feature learning, all of which are critical tasks in various domains such as finance, healthcare, and image recognition.

 

SAS boasts a rich suite of unsupervised learning methods, offering a comprehensive range of techniques and tools to tackle complex data analysis tasks effectively. From clustering algorithms like K-means to dimensionality reduction techniques such as PCA, SAS provides a robust framework for researchers and practitioners to explore, analyze, and derive meaningful insights from their data.

 

 

Find more articles from SAS Global Enablement and Learning here.

Version history
Last update:
‎12-15-2024 11:35 PM
Updated by:
Contributors

sas-innovate-white.png

Our biggest data and AI event of the year.

Don’t miss the livestream kicking off May 7. It’s free. It’s easy. And it’s the best seat in the house.

Join us virtually with our complimentary SAS Innovate Digital Pass. Watch live or on-demand in multiple languages, with translations available to help you get the most out of every session.

 

Register now!

SAS AI and Machine Learning Courses

The rapid growth of AI technologies is driving an AI skills gap and demand for AI talent. Ready to grow your AI literacy? SAS offers free ways to get started for beginners, business leaders, and analytics professionals of all skill levels. Your future self will thank you.

Get started

Article Tags