Building models with SAS Enterprise Miner, SAS Factory Miner, SAS Visual Data Mining and Machine Learning or just with programming

How to create the Term-Document Frequency Matrix using SAS Text miner?

Accepted Solution Solved
Reply
ama
N/A
Posts: 1
Accepted Solution

How to create the Term-Document Frequency Matrix using SAS Text miner?

The matrix, where terms are rows and documents are columns, is known as the
term-document frequency matrix. I can use the text miner node of enterprise miner to create the term frequency table. Could you please tell me how to create the Term-Document Frequency Matrix using SAS Text miner? Thank you very much!!

Accepted Solutions
Solution
Thursday
SAS Employee
Posts: 30

Re: How to create the Term-Document Frequency Matrix using SAS Text miner?

In the text miner node, you can get the document-by-term table (as opposed to the term-by-document table) by choosing the following settings:
1. Turn off the svd computation
2. Turn on the roll-up terms option.
3. Set the No of rolled up terms to at least as many terms as you have.
4. Finally, set the weight settings for term weight and Freq weight to none.

The output document table will be created with a variable for each kept term in the collection. The entry will be the frequency of that term.

Note for large collections, it is easy to get hundreds of thousands of distinct terms and expanding the table in this way is not recommended in those situations.

Russ

View solution in original post


All Replies
Solution
Thursday
SAS Employee
Posts: 30

Re: How to create the Term-Document Frequency Matrix using SAS Text miner?

In the text miner node, you can get the document-by-term table (as opposed to the term-by-document table) by choosing the following settings:
1. Turn off the svd computation
2. Turn on the roll-up terms option.
3. Set the No of rolled up terms to at least as many terms as you have.
4. Finally, set the weight settings for term weight and Freq weight to none.

The output document table will be created with a variable for each kept term in the collection. The entry will be the frequency of that term.

Note for large collections, it is easy to get hundreds of thousands of distinct terms and expanding the table in this way is not recommended in those situations.

Russ
SAS Employee
Posts: 6

Re: How to create the Term-Document Frequency Matrix using SAS Text miner?

Since SAS Text Miner has incorporated many changes and enhancements after version 5.1, the following response is tailored towards the newer release (version 12.1 and after.)

 

You can use the attached SAS code in a SAS Code node after a Text Filter node to create a term-by-document data set. 

******************************************************************************************
*                                                                                        *
*   Program:  TextFilter_create_term_by_doc_matrix                                       *
*   Author:   Ann Kuo                                                                    *
*   Date:     08/15/2017                                                                 *
*   Purpose:  Combine the terms and documents table to make                              *
*             the true (not sparse) term by document matrix from a Text Filter node      *
*   Note:     The output data set TextFilter<n>_termbydocmatrix contains                 *
*             _TERMNUM_, _DOCUMENT_, _COUNT_, WEIGHT, TF_IDF where                       *
*             _COUNT_ represents the frequency of a term occurred in a document and      * 
*             TF_IDF represents the TF-IDF (freq*weight)                                 *
*                                                                                        *
*             You can use the following SAS code in a SAS Code node after your Text      *
*             Filter node to create a term-by-document data set                          *
*             Enter the following code in the Training Code section after you open the   *
*             Code Editor window:                                                        *
*                                                                                        *
*             After you enter the code above, save the code, and exit from the           *
*             Code Editor window.  Run the SAS Code node.  If it runs successfully, the  *
*             textfilter<n>_termbydocmatrix.sas7bdat data set can be found in the        *
*             corresponding Enterprise Miner project Workspaces folder.                  *
******************************************************************************************;


/*-----------------------------------------------------------------------------------------
Please find the SAS code that creates SAS INSTITUTE INC. IS PROVIDING YOU WITH THE COMPUTER 
SOFTWARE CODE INCLUDED WITH THIS AGREEMENT ("CODE") ON AN "AS IS" BASIS, AND AUTHORIZES YOU 
TO USE THE CODE SUBJECT TO THE TERMS HEREOF.  BY USING THE CODE, YOU AGREE TO THESE TERMS.  
YOUR USE OF THE CODE IS AT YOUR OWN RISK.  SAS INSTITUTE INC. MAKES NO REPRESENTATION OR 
WARRANTY, EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, WARRANTIES OF MERCHANTABILITY, 
FITNESS FOR A PARTICULAR PURPOSE, NONINFRINGEMENT AND TITLE, WITH RESPECT TO THE CODE.

The Code is intended to be used solely as part of a product ("Software") you currently have 
licensed from SAS Institute Inc. or one of its subsidiaries or authorized agents ("SAS"). 
The Code is designed to either correct an error in the Software or to add functionality to 
the Software, but has not necessarily been tested.  Accordingly, SAS makes no representation
or warranty that the Code will operate error-free.  SAS is under no obligation to maintain
or support the Code.

Neither SAS nor its licensors shall be liable to you or any third party for any general, 
special, direct, indirect, consequential, incidental or other damages whatsoever arising 
out of or related to your use or inability to use the Code, even if SAS has been advised of
the possibility of such damages.

Except as otherwise provided above, the Code is governed by the same agreement that governs 
the Software.  If you do not have an existing agreement with SAS governing the Software, 
you may not use the Code.
------------------------------------------------------------------------------------------*/


/* The _tmout data set is a transposed version of the document by term matrix that is created 
   by the Text Filter node.  The variable "_termnum_" in the_tmout table is equivalent to 
   the variable "key" in the _terms table.  */

data emterms2;
   set &em_lib..&em_metasource_nodeid._terms(where= (_ISPAR eq '+')); /* filter out child term record */
   _termnum_=key;
   keep term key weight _termnum_;
run;

*sort the term counts by id so that we can merge to get term values;
proc sort data=&em_lib..&em_metasource_nodeid._tmout out=whichdoc;
   by _termnum_;
run;

*attach term counts with term values;
data identifydoc;
   merge emterms2(in=takethese) whichdoc(in=indocs);
   by _termnum_;
   if takethese & indocs;
   keep _document_ _termnum_ weight term;
run;

*now attach terms to document data set - must sort by _document_;
proc sort data=identifydoc out=subsetdocs /* nodupkey */;
   by _document_ ;
run;
proc sort data=&em_lib..&em_metasource_nodeid._tmout out=srtdocs;
   by _document_;
run;

/*merge two data sets above and compute the TF-IDF and save the result to  textfilter<N>_termByDocMatrix in Workspaces */

data &em_lib..&em_metasource_nodeid._termByDocMatrix;
   merge srtdocs subsetdocs;
   by _document_;
   TF_IDF = _count_ * weight;
run;

 Here are the steps:

  1. Create a new SAS Code node and connect it after a Text Filter node.
  2. Open the Code Editor window of the SAS Code node.
  3. Enter the SAS code above in the Training Code section.
  4. Save the code and then exit from the Code Editor window.
  5. Run the SAS Code node.
  6. Once the SAS Code node runs successfully, the  textfilter<n>_termbydocmatrix data set can be found in the corresponding Enterprise Miner project Workspaces folder.  

Hope this helps!

 

Ann

☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 2 replies
  • 1018 views
  • 0 likes
  • 3 in conversation