Hello everyone,
is it possible to build a TF-IDF Matrix in SAS Viya Model Studio? The Text Mining Node just create topic columns..
Is there a way to build the "classic" TF-IDF?
Thanks a lot!
Melanie
Hi Mellanie,
I was reading old messages and see your message which is not responded. I use this approach with good results in a customer, of course it depends of your text data and assumptions of your model, but to a general of classification of a document in a class, it works fine. You can generate TF-IDF / SVDD Features and use in a classification model.
/* TF-IDF / SVDD Parameters */
%let min_freq_doc = 150;
%let min_freq_overall = 150;
%let numLabels = 10;
%let max_k = 50;
%let LimitTermQty = 750;
/*Partition*/
proc partition data=&caslibname..BASE_DEFERIMENTO_PROC_SS samppct=15 samppct2=15 seed=8;
by DEFERIMENTO;
output out=&caslibname..partitioned_data copyvars=(_ALL_);
run;
/*SVDD*/
proc textmine data=&caslibname..partitioned_data_train LANGUAGE=Portuguese;
doc_id Document_ID ;
var texto;
parse
nonoungroups
termwgt = ENTROPY
cellwgt = LOG
entities = NONE
reducef= &min_freq_doc.
/* start = RDMF.STARTLIST - If you have one*/
outparent= &caslibname..outparent_train
outterms = &caslibname..outterms_train
outpos = &caslibname..outpos_train
outchild = &caslibname..outchild_train
outconfig= &caslibname..outconfig_train;
select "PPOS" "DET" "PN" "N" /ignore;
svd
max_k=&max_k.
numlabels=&numLabels.
outdocpro=&caslibname..outdocpro_train
svdu=&caslibname..svdu_train
outtopics=&caslibname..outtopics_train;
savestate rstore=&caslibname..outsvdmodel_train;
run;
/* save in a physical lib to score test data set*/
data RDMF.outterms_train;
set &caslibname..outterms_train;
run;
data RDMF.outconfig_train;
set &caslibname..outconfig_train;
run;
data RDMF.svdu_train;
set &caslibname..svdu_train;
run;
/* SORT */
PROC SORT data = &caslibname..outterms_train out=&caslibname..outterms_train_aux NODUPKEY;
by Parent_id Term;
RUN;
/*Limiting Terms by a frequency in overall documentos*/
PROC SORT data=&caslibname..outterms_train_aux out=&caslibname..outterms_train_aux NODUPKEY;
BY Parent_id Term;
WHERE ((Term not in (""," ")) and (LENGTH(Term) > 3) and (LENGTH(Term) <= 25) and (Freq > &min_freq_overall.));
RUN;
PROC SORT data=&caslibname..outterms_train_aux out=&caslibname..outterms_train_aux NODUPKEY;
BY Parent_id Term;
WHERE ((Term not in (""," ")) and (LENGTH(Term) > 3) and (LENGTH(Term) <= 25) and (Freq > &min_freq_overall.));
RUN;
/* RENAME */
DATA RDMF.outterms_train_aux;
SET &caslibname..outterms_train_aux (RENAME=(Parent_id=_TERMNUM_));
Term = compress(Term);
RUN;
PROC SORT DATA=RDMF.outterms_train_aux OUT=RDMF.outterms_train_aux;
BY _TERMNUM_;
RUN;
/* LIMITING TERMS QUANTITY - To algorithms with a huge cost to lead with a high number of features*/
PROC SQL;
CREATE TABLE outparent_train_aux AS
SELECT _TERMNUM_, STD(_COUNT_) AS FREQ_TERM
FROM &caslibname..outparent_train
GROUP BY _TERMNUM_
ORDER BY FREQ_TERM DESC;
QUIT;
DATA &caslibname..outparent_train_aux;
SET outparent_train_aux(obs=&LimitTermQty);
RUN;
PROC SORT DATA=&caslibname..outparent_train_aux;
BY _TERMNUM_;
RUN;
DATA RDMF.outparent_train_aux;
SET &caslibname..outparent_train_aux;
RUN;
DATA &caslibname..outparent_train_red;
MERGE &caslibname..outparent_train (IN=PAR) &caslibname..outparent_train_aux (IN=AUX KEEP=_TERMNUM_);
BY _TERMNUM_;
IF AUX;
RUN;
proc contents data=&caslibname..outparent_train_red;run;
data &caslibname..outterms_train_aux ;
set RDMF.outterms_train_aux ;
run;
/* Merge */
DATA RDMF.MATRIZ_TERMOS_DOCUMENTOS_AUX;
MERGE &caslibname..outparent_train_red (IN=PAR) &caslibname..outterms_train_aux (IN=TERMS KEEP=Term _TERMNUM_);
BY _TERMNUM_;
IF PAR;
RUN;
proc sort data=RDMF.MATRIZ_TERMOS_DOCUMENTOS_AUX out=RDMF.MATRIZ_TERMOS_DOCUMENTOS_AUX NODUPKEY;
BY _DOCUMENT_ _TERMNUM_;
RUN;
/*Add A Suffix to TF-IF variables*/
data RDMF.MATRIZ_TERMOS_DOCUMENTOS_AUX;
set RDMF.MATRIZ_TERMOS_DOCUMENTOS_AUX;
FORMAT DATASET $20.;
TF_IDF_ = 'TF_IDF_';
DATASET = 'Train';
run;
/*TRANSPOSE to get the matrix*/
proc transpose data = RDMF.MATRIZ_TERMOS_DOCUMENTOS_AUX
out = RDMF.MATRIZ_TERMOS_DOCUMENTOS PREFIX=TF_IDF_;
by _DOCUMENT_;
id _TERMNUM_;
var _COUNT_;
run;
/*Finally: After you have to score test data set*/
/* Score TF-IDF */
proc tmscore
data = &caslibname..partitioned_data_test
terms = &caslibname..outterms_train
config = &caslibname..outconfig_train
outparent = &caslibname..outparent_test
svdu = &caslibname..svdu_train
svddocpro = &caslibname..outdocpro_test;
doc_id numero_processo;
var texto;
run;
/*Do the same previous process to test dataset*/
/*Do the same process to validation and test datasets and and join all to use as input of your model*/
Hello there, @BigRider
Thank you for sharing your code.
As I reviewed the SAS code provided, I couldn't locate the specific step where TF and IDF values are multiplied to calculate the TF-IDF scores. If you could kindly point me in the right direction or provide some tips on this, I'd greatly appreciate it.
Thanks in advance.
Hello @alisio_meneses ,
In PROC TEXTMINE , these are the
Term-by-Document Matrix Creation Options
in the PARSE statement.
Koen
Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!
Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.
Find more tutorials on the SAS Users YouTube channel.