Non-negative Matrix Factorization (NMF) is a linear algebra technique that decomposes a non-negative matrix into two lower-rank non-negative matrices, capturing underlying patterns in data and enabling feature engineering. It is a powerful dimensionality reduction and feature extraction technique used in machine learning and data analysis.
Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.
NMF aims to approximately represent a nonnegative data matrix as the product of two low-rank nonnegative factor matrices. Given a dataset with n observations and p variables, NMF seeks to express X ≈ W H, where X is the n×p nonnegative data matrix, and W and H are the two low-rank nonnegative factor matrices of dimensions n×r and r×p respectively. Here r represents the rank of the low-dimensional factor matrices. The matrix W is usually called the features (or basis) matrix, while H is commonly referred to the weights (or coefficients) matrix.
Where Can NMF be Applied, and What Types of Data Is It Most Suitable For?
NMF is applied to non-negative data because it inherently enforces a parts-based, additive representation. Unlike other factorization methods that allow negative values and rely on cancellation effects, NMF decomposes data into non-negative components, making it particularly useful for applications where negative values have no meaningful interpretation.
Some common data types and their corresponding applications include:
Beyond these industry-specific applications, NMF is also valuable for clustering, as its factorized components can be interpreted as cluster representations. This makes it useful for detecting outliers and anomalies in various datasets, such as fraud detection in financial transactions.
A Feature Engineering Technique
NMF is a feature engineering technique commonly used in machine learning and data analysis for dimensionality reduction, feature identification, and feature extraction. Here’s how NMF serves these purposes:
How do we Control Approximation Accuracy?
In NMF, rank (r) refers to the number of latent features or components used to approximate the original matrix. You must specify rank, which must be an integer greater than zero.
The choice of r determines the complexity and quality of the approximation. It basically tells you how many “essential” rows or columns are needed to define the two low-rank nonnegative factor matrices. A small rank results in a more compressed representation but may lose important details. A large rank provides a more detailed approximation but may lead to overfitting. Choosing the optimal rank is crucial and often done empirically or using techniques like cross-validation, the elbow method, or domain knowledge.
An Optimization Algorithm
NMF is essentially an optimization algorithm because it seeks to find two lower-rank nonnegative matrices whose product best approximates the original matrix. This involves minimizing a reconstruction error, typically measured using a loss function like the Frobenius norm or Kullback-Leibler divergence, through iterative updates of the factor matrices. SAS Viya implements two optimization methods for performing NMF – Alternating Proximal Gradient (APG) and Compressed Alternating Proximal Gradient (CAPG) using Random Projections.
How is Data Prepared for the NMF Procedure?
The NMF procedure performs nonnegative matrix factorization in SAS Viya and requires input data to follow a specific structure. Sparse matrices typically contain a large number of zero values. To save memory, such data is commonly stored in the COO (Coordinate List) format, which records only the non-zero entries using triplets of the form (row, column, value). This compact representation significantly reduces storage requirements compared to storing the full matrix. For example, in text data, rows may represent terms, columns represent documents, and the values are term counts. Similarly, in recommender systems, rows represent users, columns represent items (e.g., movies), and values represent ratings.
However, the NMF procedure does not support input in COO format. It requires input as a dense matrix, where all values (including zeros) are stored in a contiguous two-dimensional array. In this grid-like representation, values such as counts or ratings are stored across multiple columns within each row.
Therefore, data stored in COO format must first be converted to dense format before applying NMF. This conversion can be done in several ways: using a SAS DATA step for small datasets, using the PYTHON procedure in SAS to leverage Python code within SAS, using the FEDSQL procedure, which allows high-performance, ANSI SQL:1999-compliant queries across diverse data sources, and more.
Additionally, from an implementation perspective, working solely with IDs can be challenging. It is often useful to maintain row and column labels, which can be stored separately in row metadata and column metadata files to aid interpretation and downstream analysis.
The upcoming posts in this series — Nonnegative Matrix Factorization (Part 2): Discovering Topics from Documents and Nonnegative Matrix Factorization (Part 3): Making Recommendations Using Matrix Completion — will demonstrate the implementation of PROC NMF for topic modeling on text data and for building a recommender system using user-item ratings, respectively.
Concluding Remarks
Nonnegative Matrix Factorization (NMF) offers interpretable, parts-based representations by enforcing nonnegativity, often making it more intuitive than methods like PCA or SVD. It effectively reduces dimensionality while preserving structure, making it valuable in fields like text mining, audio separation, and bioinformatics. Its nonlinear nature captures local patterns, aiding in clustering. However, NMF has limitations: solutions are not unique, computations can be intensive, optimization may get stuck in local minima, and results are sensitive to initialization. Choosing the right rank r also requires care, often needing domain expertise or validation.
Find more articles from SAS Global Enablement and Learning here.
Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!
The rapid growth of AI technologies is driving an AI skills gap and demand for AI talent. Ready to grow your AI literacy? SAS offers free ways to get started for beginners, business leaders, and analytics professionals of all skill levels. Your future self will thank you.