I am using SAS EM 12.1 version now. I would like to get term frequency inversed document frequency table from Text Parser.
I can see that the exported transaction dataset is the table. It has three column: term index column, document index column, and the weight for that term in that document. However, each term is represented as an index, not the actual word.
Is there a way to find the look up table to map each term's index with the actual word?
You are exactly right, the transaction table is the TFIDF table. If you want to see it as term|role combinations, you can do something like the following with code or in code node (assume that this is on the first diagram, and the first text filter node on that diagram:
%let filternode_name=emws1.textfilternode;
%let viewname=<whatever data set you want to create>;
proc sql noprint;
create view &viewname as
select ktrim(term) || '|' || role as _item_, b.*
from &filternode_name._term_strings as a, &filternode_name._out_parent as b
where b._termnum_=a.key;
quit;
Great. I use proc contents and find many more datasets.
But I have a follow-up question. I got great result when using SVM model based on TFIDF matrix as the input variables for classification purposes.
Now I need a scoring dataset, that will go through parsing and filtering. But I do not see a way to get the TFIDF matrix based on the score dataset, which will be subsequently used by SVM. This is because there is only one transaction dataset out of the text filter node. Is this doable?
the <nodename>_validout and _testout tables contain the tfidf weightings for the validation and test set respectively.
None of the data sets seem to be the same in SAS EM 13.1, any hints on where are links between the nodes? I can see several possibilities but never played SAS at this level.
Jacob
Actually the answer was in the tiny picture attached to one of the previous messages. The TF-IDF matrx, in its sparse representation, can be found in the TRANSACTION data set returned from the Text Filter, providing the weights have been set to TF-IDF. Jacob
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.
Find more tutorials on the SAS Users YouTube channel.