Non-negative Matrix Factorization (Part 2): Discovering Topics from Documents

In the previous post, Nonnegative Matrix Factorization (Part 1): Understanding Its Importance and Applications, we explored several applications of NMF across different domains, including image processing, topic modeling, gene analysis, recommender systems, spectral data analysis, and more. Now, it's time to dive in and apply what we’ve discussed. In this post, we'll work with text data to perform topic modeling, uncovering hidden themes within documents.

In natural language understanding (NLU) tasks, meaning can be extracted through multiple levels, ranging from words and sentences to paragraphs and entire documents. At the document level, one of the most effective ways to interpret text is by analyzing its topics. The process of identifying, learning, and extracting these topics across a collection of documents is known as topic modeling. As shown here, we decompose the term-document matrix X into two matrices. The first matrix W represents each topic, and the terms associated with it. The second matrix H represents each document and the topics it contains.

Topic modeling helps in multiple ways:

It groups similar documents together, aiding in efficient information retrieval.
It extracts key themes from large datasets, making it easier to understand the content instantly.
It enhances recommendation systems by linking users to relevant topics based on their interests.
It helps businesses analyze customer feedback, reviews, and surveys to identify trends and sentiments.
It identifies unusual patterns in text-based data for risk management and regulatory compliance.

By automating topic discovery, it saves time, improves decision-making, and enhances content analytics. Here we will illustrate how you can use the NMF procedure to discover topics (main themes) from contents of the documents.

We start with a simple text data table. Each row represents a document and includes two key variables: one for the document ID and another for the actual text content.

Next, we parse the text using PROC TEXTMINE, which helps break down the language with built-in NLP features like tokenization, stemming, part-of-speech tagging, and more. This step transforms our raw text into a term-by-document matrix — essentially mapping which words appear in which documents.

Once the term-by-document matrix is obtained, we convert it from a sparse (COO) format into a dense format to make it ready for analysis.

Finally, we run PROC NMF on this dense matrix to uncover the main topics or themes across the documents. This helps us understand the underlying structure and key ideas within the collection.

Use Case Example

A synthetic text dataset was generated using the generative AI system ChatGPT, a large language model, focusing on concerns raised by environmentalists regarding environmental sustainability and global warming.

The following DATA step creates the data table mylib.environData, which contains 50 observations that have two variables, in your CAS session. The variable Text contains the input documents, and the variable DocID contains the IDs of the documents. Each row in the data table represents a one-liner document for analysis.

cas;
libname mylib cas;

data mylib.environData;
infile datalines delimiter='|' missover; length Text $200;
input DocID Text$;
datalines;
1 | Global temperatures are rising at an unprecedented rate due to greenhouse gas emissions.
2 | Climate change is causing more frequent and severe heatwaves.
3 | Melting glaciers and polar ice caps are accelerating global warming feedback loops.
...
49 | Pollution from agriculture is causing dead zones in aquatic systems.
50 | Resource scarcity is creating geopolitical tensions and inequality.
run;

To perform topic discovery using PROC NMF, we first generate a term-by-document matrix from the mylib.environData data table using PROC TEXTMINE.

Many text mining applications benefit from the use of a stop list, which contains a set of commonly used terms in a language. These terms are typically removed because they are often uninformative, can introduce noise, and may increase memory usage The following DATA step creates the stop list for use in PROC TEXTMINE to eliminate noisy, noninformative terms:

data mylib.en_stopList;
length Term $16;
input Term;
datalines;
about and are as between for from
in is of or than the this to with
;

The following statements invoke PROC TEXTMINE to run on the mylib.environData data table and specify that all terms in the input document collection, except for those on the stop list, are to be kept for generating the term-by-document matrix. The summary information about the terms in the document collection is stored in a data table named mylib.terms. The term-by-document matrix is stored in a data table named mylib.termDoc.

proc textmine data=mylib.environData;
doc_id Docid;
variables Text;
parse stop = mylib.en_stopList
outterms = mylib.terms
outparent = mylib.termDoc reducef = 1;
run;

The following statements convert the mylib.termDoc data table that is stored in COO format to a data table named mylib.termDocDense that is stored in dense format. The mylib.termdocDense data table contains 52 columns: ID (ID of the terms), Term, and DOC1, DOC2, ..., DOC50 (which correspond to the 50 documents). Each row in the data table contains the term counts of a certain term in every document.

data terms;
set mylib.terms;
keep Term Key;
rename Key=_termnum_;
run;

proc sort data=terms;
by _termnum_;
run;

data termdoc;
set mylib.termdoc;
run;

proc sort data=termdoc;
by _termnum_ _document_;
run;

data termdocMerge;
merge terms termdoc;
by _termnum_;
if missing(_count_) then delete;
run;

proc sql noprint;
select max(_document_) into :max_docid from termdocMerge;
run;

%let maxid = &max_docid;
data mylib.termdocDense;
length id 8;
array docs{&maxid} doc1 - doc&maxid; retain doc1 - doc&maxid;
set termdocMerge;
by _termnum_;
id = _termnum_;
if First._termnum_ then do;
do i= 1 to &maxid; docs{i} = 0;
end;
end;
docs{_document_} = _count_;
if Last._termnum_; drop i _termnum_ _document_ _count_;
run;

We specify RANK=5 in PROC NMF to derive five topics from the mylib.termdocDense data table. The following statements invoke the NMF procedure to run on this data table and output the factor matrices W and H to the output data tables mylib.W and mylib.H, respectively:

proc nmf data=mylib.termdocDense seed=789 rank=5 outh=mylib.H;
var doc1-doc&maxid;
output out=mylib.W comp=Topic copyvar=Term;
run;

The Model Information table displays basic information about the model, including the input data table, the number of VAR statement variables, the target rank of factor matrices, the factorization method, and the stopping criterion used in the computation. It also displays information about the factorization method, including the maximum number of iterations, the number of matrix updates at each iteration, the convergence tolerance, the random number seed, whether to scale input data or not, the way to handle the missing values, and the coefficient for extrapolation weight.

The Iteration Results table displays the matrix factorization accuracy information, which includes the number of iterations, the relative error at which the iteration stops, and the stopping criterion. The table also displays the sparsity of the factor matrices W and H. The sparsity of a matrix is the proportion of the number of zero-valued elements to the total number of elements in the matrix.

The Output Cas Tables has information about each CAS table that is created during a CAS action execution. The information for each CAS table consists of the CAS table name, the caslib in which the table resides, and the number of columns and rows in the CAS table. Because the mylib.termdocDense data table is treated as a term-by-document matrix, W is considered a features matrix, where each column in W is a feature (topic) vector, and H is considered a weights matrix, where each column in H represents the topic membership values in a document. Using the output data table mylib.W, you can obtain the most important terms (that is, the terms that have the largest cell values) in each topic.

The following PROC CAS statements use the output data table mylib.W and the table.fetch action to generate the results table "Topics," which contains the five discovered topics with the top 10 terms (sorted by descending cell values) to characterize each topic.

proc cas;
   topic=${Topic1-Topic5};
   cols='_Index_' + topic;
   coltypes=${integer, varchar, varchar, varchar, varchar, varchar};
   Topics=newtable('Topics', cols, coltypes);

   t={};
   do i=1 to 5;
      table.fetch result=r /
         table='W'
    	    fetchVars={'Term'}
         sortby={{name=topic[i] order='descending'}}
         to=10;
      t[i]=r.Fetch[, 'Term'];
   end;

   row={};
   do j=1 to 10;
      row[1]=j;
      do i=1 to 5; row[i+1]=t[i][j]; end;
      addrow(Topics, row);
   end;

   print Topics;
run;
quit;

Analyzing the 10 most significant terms in each topic probably reveals the following key themes:

Topic 1: The unprecedented rise in sea levels due to greenhouse gas emissions.
Topic 2: The ecological disruption and species imbalance caused by climate change and pollution.
Topic 3: The role of atmospheric carbon dioxide in trapping heat and warming the Earth’s atmosphere.
Topic 4: The global waste crisis fueled by heavy reliance on single-use materials.
Topic 5: The acceleration of polar ice cap melting due to fossil fuel consumption and the resulting global warming feedback loop.

These represent five of the most pressing issues in environmental sustainability and climate change today.

In the next post, Nonnegative Matrix Factorization (Part 3): Making Recommendations Using Matrix Completion, we'll switch gears to user-item ratings data and build a recommender system, helping to make personalized recommendations. So, stay tuned!

Find more articles from SAS Global Enablement and Learning here.

Non-negative Matrix Factorization (Part 2): Discovering Topics from Documents

Registration is open

SAS AI and Machine Learning Courses