09-16-2014 03:28 PM
I am using SAS EM 12.1 version now. I would like to get term frequency inversed document frequency table from Text Parser.
I can see that the exported transaction dataset is the table. It has three column: term index column, document index column, and the weight for that term in that document. However, each term is represented as an index, not the actual word.
Is there a way to find the look up table to map each term's index with the actual word?
09-16-2014 03:46 PM
You are exactly right, the transaction table is the TFIDF table. If you want to see it as term|role combinations, you can do something like the following with code or in code node (assume that this is on the first diagram, and the first text filter node on that diagram:
%let viewname=<whatever data set you want to create>;
proc sql noprint;
create view &viewname as
select ktrim(term) || '|' || role as _item_, b.*
from &filternode_name._term_strings as a, &filternode_name._out_parent as b
09-17-2014 04:58 PM
Great. I use proc contents and find many more datasets.
But I have a follow-up question. I got great result when using SVM model based on TFIDF matrix as the input variables for classification purposes.
Now I need a scoring dataset, that will go through parsing and filtering. But I do not see a way to get the TFIDF matrix based on the score dataset, which will be subsequently used by SVM. This is because there is only one transaction dataset out of the text filter node. Is this doable?
09-16-2015 01:21 AM
None of the data sets seem to be the same in SAS EM 13.1, any hints on where are links between the nodes? I can see several possibilities but never played SAS at this level.
09-16-2015 10:56 PM
Actually the answer was in the tiny picture attached to one of the previous messages. The TF-IDF matrx, in its sparse representation, can be found in the TRANSACTION data set returned from the Text Filter, providing the weights have been set to TF-IDF. Jacob