TF-IDF with SAS Viya

Melanie3 · Posted 11-20-2020 02:28 AM

Hello everyone,

is it possible to build a TF-IDF Matrix in SAS Viya Model Studio? The Text Mining Node just create topic columns..

Is there a way to build the "classic" TF-IDF?

Thanks a lot!

Melanie

BigRider · Posted 09-29-2022 10:31 AM

Hi Mellanie,

I was reading old messages and see your message which is not responded. I use this approach with good results in a customer, of course it depends of your text data and assumptions of your model, but to a general of classification of a document in a class, it works fine. You can generate TF-IDF / SVDD Features and use in a classification model.

		/* TF-IDF / SVDD Parameters */
	%let min_freq_doc =  150;
	%let min_freq_overall = 150;
	%let numLabels = 10;
	%let max_k = 50;
	%let LimitTermQty = 750;

	/*Partition*/
	proc partition data=&caslibname..BASE_DEFERIMENTO_PROC_SS  samppct=15 samppct2=15 seed=8;
		by DEFERIMENTO;
		output out=&caslibname..partitioned_data copyvars=(_ALL_);
	run;

	/*SVDD*/
	proc textmine data=&caslibname..partitioned_data_train LANGUAGE=Portuguese;
	doc_id Document_ID ;
	var texto;
	parse
	   nonoungroups
	   termwgt  = ENTROPY
	   cellwgt  = LOG
	   entities = NONE
	   reducef= &min_freq_doc.
	/*   start = RDMF.STARTLIST - If you have one*/
	   outparent= &caslibname..outparent_train
	   outterms = &caslibname..outterms_train
	   outpos = &caslibname..outpos_train
	   outchild = &caslibname..outchild_train
	   outconfig= &caslibname..outconfig_train;
	   
	   select "PPOS" "DET" "PN" "N" /ignore;
	svd 
	   max_k=&max_k. 
	   numlabels=&numLabels.
	   outdocpro=&caslibname..outdocpro_train
	   svdu=&caslibname..svdu_train
	   outtopics=&caslibname..outtopics_train;
	   savestate rstore=&caslibname..outsvdmodel_train;
	run;

/* save in a physical lib to score test data set*/
	data RDMF.outterms_train;
		set &caslibname..outterms_train;
	run;
	
	data RDMF.outconfig_train;
		set &caslibname..outconfig_train;
	run;
	data RDMF.svdu_train;
		set &caslibname..svdu_train;
	run;
	
	/* SORT */
	PROC SORT data = &caslibname..outterms_train out=&caslibname..outterms_train_aux NODUPKEY;
	by Parent_id Term;
	RUN;

	/*Limiting Terms by a frequency in overall documentos*/
	PROC SORT data=&caslibname..outterms_train_aux out=&caslibname..outterms_train_aux NODUPKEY;
		BY Parent_id Term;
		WHERE ((Term not in (""," ")) and (LENGTH(Term) > 3) and (LENGTH(Term) <= 25) and (Freq > &min_freq_overall.));
	RUN;

	PROC SORT data=&caslibname..outterms_train_aux out=&caslibname..outterms_train_aux NODUPKEY;
		BY Parent_id Term;
		WHERE ((Term not in (""," ")) and (LENGTH(Term) > 3) and (LENGTH(Term) <= 25) and (Freq > &min_freq_overall.));
	RUN;
	
	/* RENAME */
	DATA RDMF.outterms_train_aux;
		SET &caslibname..outterms_train_aux (RENAME=(Parent_id=_TERMNUM_));
		Term = compress(Term);
	RUN;
	
	PROC SORT DATA=RDMF.outterms_train_aux OUT=RDMF.outterms_train_aux;
		BY _TERMNUM_;
	RUN;
	
	/* LIMITING TERMS QUANTITY - To algorithms with a huge cost to lead with a high number of features*/
	PROC SQL;
		CREATE TABLE outparent_train_aux AS
		SELECT _TERMNUM_, STD(_COUNT_) AS FREQ_TERM
		FROM &caslibname..outparent_train
		GROUP BY _TERMNUM_
		ORDER BY FREQ_TERM DESC;
	QUIT;

	DATA &caslibname..outparent_train_aux;
		SET outparent_train_aux(obs=&LimitTermQty);
	RUN;

	PROC SORT DATA=&caslibname..outparent_train_aux;
		BY _TERMNUM_;
	RUN;

	DATA RDMF.outparent_train_aux;
		SET &caslibname..outparent_train_aux;
	RUN;
	
	DATA &caslibname..outparent_train_red;
	MERGE &caslibname..outparent_train (IN=PAR) &caslibname..outparent_train_aux (IN=AUX KEEP=_TERMNUM_);
	BY _TERMNUM_;
	IF AUX;
	RUN;

	proc contents data=&caslibname..outparent_train_red;run;
	
	data &caslibname..outterms_train_aux ;
	set RDMF.outterms_train_aux ;
	run;

	/* Merge */
	DATA RDMF.MATRIZ_TERMOS_DOCUMENTOS_AUX;
	MERGE &caslibname..outparent_train_red (IN=PAR) &caslibname..outterms_train_aux (IN=TERMS KEEP=Term _TERMNUM_);
	BY _TERMNUM_;
	IF PAR;
	RUN;

	proc sort data=RDMF.MATRIZ_TERMOS_DOCUMENTOS_AUX out=RDMF.MATRIZ_TERMOS_DOCUMENTOS_AUX NODUPKEY;
		BY _DOCUMENT_ _TERMNUM_;
	RUN;

	/*Add A Suffix to TF-IF variables*/	
	data  RDMF.MATRIZ_TERMOS_DOCUMENTOS_AUX;
	set  RDMF.MATRIZ_TERMOS_DOCUMENTOS_AUX;
	FORMAT DATASET $20.;
	TF_IDF_ = 'TF_IDF_'; 
	DATASET = 'Train';
	run;


	/*TRANSPOSE to get the matrix*/
	proc transpose data = RDMF.MATRIZ_TERMOS_DOCUMENTOS_AUX 
	out = RDMF.MATRIZ_TERMOS_DOCUMENTOS PREFIX=TF_IDF_;
	by _DOCUMENT_;
	id _TERMNUM_;	
	var _COUNT_;
	run;
	
	/*Finally: After you have to score test data set*/
		/* Score TF-IDF */
	proc tmscore
	  data      = &caslibname..partitioned_data_test 
	  terms     = &caslibname..outterms_train
	  config    = &caslibname..outconfig_train
	  outparent = &caslibname..outparent_test
	  svdu = &caslibname..svdu_train
	  svddocpro = &caslibname..outdocpro_test;
	  doc_id      numero_processo;
	  var  texto;
	run;
	
	/*Do the same previous process to test dataset*/
	/*Do the same process to validation and test datasets and and join all to use as input of your model*/

alisio_meneses · Posted 12-08-2023 11:32 AM

Hello there, @BigRider

Thank you for sharing your code.

As I reviewed the SAS code provided, I couldn't locate the specific step where TF and IDF values are multiplied to calculate the TF-IDF scores. If you could kindly point me in the right direction or provide some tips on this, I'd greatly appreciate it.

Thanks in advance.