Hello everyone,
is it possible to build a TF-IDF Matrix in SAS Viya Model Studio? The Text Mining Node just create topic columns..
Is there a way to build the "classic" TF-IDF?
Thanks a lot!
Melanie
Hi Mellanie,
I was reading old messages and see your message which is not responded. I use this approach with good results in a customer, of course it depends of your text data and assumptions of your model, but to a general of classification of a document in a class, it works fine. You can generate TF-IDF / SVDD Features and use in a classification model.
/* TF-IDF / SVDD Parameters */
%let min_freq_doc = 150;
%let min_freq_overall = 150;
%let numLabels = 10;
%let max_k = 50;
%let LimitTermQty = 750;
/*Partition*/
proc partition data=&caslibname..BASE_DEFERIMENTO_PROC_SS samppct=15 samppct2=15 seed=8;
by DEFERIMENTO;
output out=&caslibname..partitioned_data copyvars=(_ALL_);
run;
/*SVDD*/
proc textmine data=&caslibname..partitioned_data_train LANGUAGE=Portuguese;
doc_id Document_ID ;
var texto;
parse
nonoungroups
termwgt = ENTROPY
cellwgt = LOG
entities = NONE
reducef= &min_freq_doc.
/* start = RDMF.STARTLIST - If you have one*/
outparent= &caslibname..outparent_train
outterms = &caslibname..outterms_train
outpos = &caslibname..outpos_train
outchild = &caslibname..outchild_train
outconfig= &caslibname..outconfig_train;
select "PPOS" "DET" "PN" "N" /ignore;
svd
max_k=&max_k.
numlabels=&numLabels.
outdocpro=&caslibname..outdocpro_train
svdu=&caslibname..svdu_train
outtopics=&caslibname..outtopics_train;
savestate rstore=&caslibname..outsvdmodel_train;
run;
/* save in a physical lib to score test data set*/
data RDMF.outterms_train;
set &caslibname..outterms_train;
run;
data RDMF.outconfig_train;
set &caslibname..outconfig_train;
run;
data RDMF.svdu_train;
set &caslibname..svdu_train;
run;
/* SORT */
PROC SORT data = &caslibname..outterms_train out=&caslibname..outterms_train_aux NODUPKEY;
by Parent_id Term;
RUN;
/*Limiting Terms by a frequency in overall documentos*/
PROC SORT data=&caslibname..outterms_train_aux out=&caslibname..outterms_train_aux NODUPKEY;
BY Parent_id Term;
WHERE ((Term not in (""," ")) and (LENGTH(Term) > 3) and (LENGTH(Term) <= 25) and (Freq > &min_freq_overall.));
RUN;
PROC SORT data=&caslibname..outterms_train_aux out=&caslibname..outterms_train_aux NODUPKEY;
BY Parent_id Term;
WHERE ((Term not in (""," ")) and (LENGTH(Term) > 3) and (LENGTH(Term) <= 25) and (Freq > &min_freq_overall.));
RUN;
/* RENAME */
DATA RDMF.outterms_train_aux;
SET &caslibname..outterms_train_aux (RENAME=(Parent_id=_TERMNUM_));
Term = compress(Term);
RUN;
PROC SORT DATA=RDMF.outterms_train_aux OUT=RDMF.outterms_train_aux;
BY _TERMNUM_;
RUN;
/* LIMITING TERMS QUANTITY - To algorithms with a huge cost to lead with a high number of features*/
PROC SQL;
CREATE TABLE outparent_train_aux AS
SELECT _TERMNUM_, STD(_COUNT_) AS FREQ_TERM
FROM &caslibname..outparent_train
GROUP BY _TERMNUM_
ORDER BY FREQ_TERM DESC;
QUIT;
DATA &caslibname..outparent_train_aux;
SET outparent_train_aux(obs=&LimitTermQty);
RUN;
PROC SORT DATA=&caslibname..outparent_train_aux;
BY _TERMNUM_;
RUN;
DATA RDMF.outparent_train_aux;
SET &caslibname..outparent_train_aux;
RUN;
DATA &caslibname..outparent_train_red;
MERGE &caslibname..outparent_train (IN=PAR) &caslibname..outparent_train_aux (IN=AUX KEEP=_TERMNUM_);
BY _TERMNUM_;
IF AUX;
RUN;
proc contents data=&caslibname..outparent_train_red;run;
data &caslibname..outterms_train_aux ;
set RDMF.outterms_train_aux ;
run;
/* Merge */
DATA RDMF.MATRIZ_TERMOS_DOCUMENTOS_AUX;
MERGE &caslibname..outparent_train_red (IN=PAR) &caslibname..outterms_train_aux (IN=TERMS KEEP=Term _TERMNUM_);
BY _TERMNUM_;
IF PAR;
RUN;
proc sort data=RDMF.MATRIZ_TERMOS_DOCUMENTOS_AUX out=RDMF.MATRIZ_TERMOS_DOCUMENTOS_AUX NODUPKEY;
BY _DOCUMENT_ _TERMNUM_;
RUN;
/*Add A Suffix to TF-IF variables*/
data RDMF.MATRIZ_TERMOS_DOCUMENTOS_AUX;
set RDMF.MATRIZ_TERMOS_DOCUMENTOS_AUX;
FORMAT DATASET $20.;
TF_IDF_ = 'TF_IDF_';
DATASET = 'Train';
run;
/*TRANSPOSE to get the matrix*/
proc transpose data = RDMF.MATRIZ_TERMOS_DOCUMENTOS_AUX
out = RDMF.MATRIZ_TERMOS_DOCUMENTOS PREFIX=TF_IDF_;
by _DOCUMENT_;
id _TERMNUM_;
var _COUNT_;
run;
/*Finally: After you have to score test data set*/
/* Score TF-IDF */
proc tmscore
data = &caslibname..partitioned_data_test
terms = &caslibname..outterms_train
config = &caslibname..outconfig_train
outparent = &caslibname..outparent_test
svdu = &caslibname..svdu_train
svddocpro = &caslibname..outdocpro_test;
doc_id numero_processo;
var texto;
run;
/*Do the same previous process to test dataset*/
/*Do the same process to validation and test datasets and and join all to use as input of your model*/
Hello there, @BigRider
Thank you for sharing your code.
As I reviewed the SAS code provided, I couldn't locate the specific step where TF and IDF values are multiplied to calculate the TF-IDF scores. If you could kindly point me in the right direction or provide some tips on this, I'd greatly appreciate it.
Thanks in advance.
Hello @alisio_meneses ,
In PROC TEXTMINE , these are the
Term-by-Document Matrix Creation Options
in the PARSE statement.
Koen
Available on demand!
Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.
Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.
Find more tutorials on the SAS Users YouTube channel.