BookmarkSubscribeRSS Feed
bachfan
Calcite | Level 5

I am using SAS EM 12.1 version now. I would like to get term frequency inversed document frequency table from Text Parser.

I can see that the exported transaction dataset is the table. It has three column: term index column, document index column, and the weight for that term in that document.  However, each term is represented as an index, not the actual word.

Is there a way to find the look up table to map each term's index with the actual word?

5 REPLIES 5
JamesCoxPhD
SAS Employee


You are exactly right, the transaction table is the TFIDF table.  If you want to see it as term|role combinations, you can do something like the following with code or in code node (assume that this is on the first diagram, and the first text filter node on that diagram:

%let filternode_name=emws1.textfilternode;

%let viewname=<whatever data set you want to create>;

   proc sql noprint;

      create view &viewname as

       select ktrim(term) || '|' || role as _item_, b.*

       from &filternode_name._term_strings as a, &filternode_name._out_parent as b

       where b._termnum_=a.key;

         quit;

bachfan
Calcite | Level 5

Great. I use proc contents and find many more datasets.

But I have a follow-up question. I got great result when using SVM model based on TFIDF matrix as the input variables for classification purposes.

Now I need  a scoring dataset, that will go through parsing and filtering. But I do not see a way to get the TFIDF matrix based on the score dataset, which will be subsequently used by SVM. This is because there is only one transaction dataset out of the text filter node. Is this doable?

tfidf.jpg

JamesCoxPhD
SAS Employee

the <nodename>_validout and _testout tables contain the tfidf weightings for the validation and test set respectively.

ironfrown
Calcite | Level 5

None of the data sets seem to be the same in SAS EM 13.1, any hints on where are links between the nodes? I can see several possibilities but never played SAS at this level.

 

Jacob

ironfrown
Calcite | Level 5

Actually the answer was in the tiny picture attached to one of the previous messages. The TF-IDF matrx, in its sparse representation, can be found in the TRANSACTION data set returned from the Text Filter, providing the weights have been set to TF-IDF. Jacob

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 5 replies
  • 4317 views
  • 0 likes
  • 3 in conversation