****************************************************************************************** * * * Program: TextFilter_create_term_by_doc_matrix * * Author: Ann Kuo * * Date: 08/17/2017 * * Purpose: Combine the terms and documents table to make * * the true (not sparse) term by document matrix from a Text Filter node * * Note: The output data set TextFilter_termbydocmatrix contains * * _TERMNUM_, _DOCUMENT_, _COUNT_, WEIGHT, TF_IDF where * * _COUNT_ represents the frequency of a term occurred in a document and * * TF_IDF represents the TF-IDF (freq*weight) * * * * You can use the following SAS code in a SAS Code node after your Text * * Filter node to create a term-by-document data set * * Enter the following code in the Training Code section after you open the * * Code Editor window: * * * * After you enter the code above, save the code, and exit from the * * Code Editor window. Run the SAS Code node. If it runs successfully, the * * textfilter_termbydocmatrix.sas7bdat data set can be found in the * * corresponding Enterprise Miner project Workspaces folder. * ******************************************************************************************; /*----------------------------------------------------------------------------------------- Please find the SAS code that creates SAS INSTITUTE INC. IS PROVIDING YOU WITH THE COMPUTER SOFTWARE CODE INCLUDED WITH THIS AGREEMENT ("CODE") ON AN "AS IS" BASIS, AND AUTHORIZES YOU TO USE THE CODE SUBJECT TO THE TERMS HEREOF. BY USING THE CODE, YOU AGREE TO THESE TERMS. YOUR USE OF THE CODE IS AT YOUR OWN RISK. SAS INSTITUTE INC. MAKES NO REPRESENTATION OR WARRANTY, EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, NONINFRINGEMENT AND TITLE, WITH RESPECT TO THE CODE. The Code is intended to be used solely as part of a product ("Software") you currently have licensed from SAS Institute Inc. or one of its subsidiaries or authorized agents ("SAS"). The Code is designed to either correct an error in the Software or to add functionality to the Software, but has not necessarily been tested. Accordingly, SAS makes no representation or warranty that the Code will operate error-free. SAS is under no obligation to maintain or support the Code. Neither SAS nor its licensors shall be liable to you or any third party for any general, special, direct, indirect, consequential, incidental or other damages whatsoever arising out of or related to your use or inability to use the Code, even if SAS has been advised of the possibility of such damages. Except as otherwise provided above, the Code is governed by the same agreement that governs the Software. If you do not have an existing agreement with SAS governing the Software, you may not use the Code. ------------------------------------------------------------------------------------------*/ /* The _tmout data set is a transposed version of the document by term matrix that is created by the Text Filter node. The variable "_termnum_" in the_tmout table is equivalent to the variable "key" in the _terms table. */ data emterms2; set &em_lib..&em_metasource_nodeid._terms(where= (_ISPAR eq '+')); /* filter out child term record */ _termnum_=key; keep term key weight _termnum_; run; *sort the term counts by id so that we can merge to get term values; proc sort data=&em_lib..&em_metasource_nodeid._tmout out=whichdoc; by _termnum_; run; *attach term counts with term values; data identifydoc; merge emterms2(in=takethese) whichdoc(in=indocs); by _termnum_; if takethese & indocs; keep _document_ _termnum_ weight term; run; *now attach terms to document data set - must sort by _document_; proc sort data=identifydoc out=subsetdocs /* nodupkey */; by _document_ ; run; proc sort data=&em_lib..&em_metasource_nodeid._tmout out=srtdocs; by _document_; run; /*merge two data sets above and compute the TF-IDF and save the result to textfilter_termByDocMatrix in Workspaces */ data &em_lib..&em_metasource_nodeid._termByDocMatrix; merge srtdocs subsetdocs; by _document_; TF_IDF = _count_ * weight; run;