BookmarkSubscribeRSS Feed
Melanie3
Calcite | Level 5

Hello everyone,

 

is it possible to build a TF-IDF Matrix in SAS Viya Model Studio? The Text Mining Node just create topic columns..

Is there a way to build the "classic" TF-IDF?

 

Thanks a lot!

Melanie

3 REPLIES 3
BigRider
Obsidian | Level 7

Hi Mellanie,

 

I was reading old messages and see your message which is not responded. I use this approach with good results in a customer, of course it depends of your text data and assumptions of your model, but to a general of classification of a document in a class, it works fine. You can generate TF-IDF / SVDD Features and use in a classification model.

 

		/* TF-IDF / SVDD Parameters */
	%let min_freq_doc =  150;
	%let min_freq_overall = 150;
	%let numLabels = 10;
	%let max_k = 50;
	%let LimitTermQty = 750;

	/*Partition*/
	proc partition data=&caslibname..BASE_DEFERIMENTO_PROC_SS  samppct=15 samppct2=15 seed=8;
		by DEFERIMENTO;
		output out=&caslibname..partitioned_data copyvars=(_ALL_);
	run;

	/*SVDD*/
	proc textmine data=&caslibname..partitioned_data_train LANGUAGE=Portuguese;
	doc_id Document_ID ;
	var texto;
	parse
	   nonoungroups
	   termwgt  = ENTROPY
	   cellwgt  = LOG
	   entities = NONE
	   reducef= &min_freq_doc.
	/*   start = RDMF.STARTLIST - If you have one*/
	   outparent= &caslibname..outparent_train
	   outterms = &caslibname..outterms_train
	   outpos = &caslibname..outpos_train
	   outchild = &caslibname..outchild_train
	   outconfig= &caslibname..outconfig_train;
	   
	   select "PPOS" "DET" "PN" "N" /ignore;
	svd 
	   max_k=&max_k. 
	   numlabels=&numLabels.
	   outdocpro=&caslibname..outdocpro_train
	   svdu=&caslibname..svdu_train
	   outtopics=&caslibname..outtopics_train;
	   savestate rstore=&caslibname..outsvdmodel_train;
	run;

/* save in a physical lib to score test data set*/
	data RDMF.outterms_train;
		set &caslibname..outterms_train;
	run;
	
	data RDMF.outconfig_train;
		set &caslibname..outconfig_train;
	run;
	data RDMF.svdu_train;
		set &caslibname..svdu_train;
	run;
	
	/* SORT */
	PROC SORT data = &caslibname..outterms_train out=&caslibname..outterms_train_aux NODUPKEY;
	by Parent_id Term;
	RUN;

	/*Limiting Terms by a frequency in overall documentos*/
	PROC SORT data=&caslibname..outterms_train_aux out=&caslibname..outterms_train_aux NODUPKEY;
		BY Parent_id Term;
		WHERE ((Term not in (""," ")) and (LENGTH(Term) > 3) and (LENGTH(Term) <= 25) and (Freq > &min_freq_overall.));
	RUN;

	PROC SORT data=&caslibname..outterms_train_aux out=&caslibname..outterms_train_aux NODUPKEY;
		BY Parent_id Term;
		WHERE ((Term not in (""," ")) and (LENGTH(Term) > 3) and (LENGTH(Term) <= 25) and (Freq > &min_freq_overall.));
	RUN;
	
	/* RENAME */
	DATA RDMF.outterms_train_aux;
		SET &caslibname..outterms_train_aux (RENAME=(Parent_id=_TERMNUM_));
		Term = compress(Term);
	RUN;
	
	PROC SORT DATA=RDMF.outterms_train_aux OUT=RDMF.outterms_train_aux;
		BY _TERMNUM_;
	RUN;
	
	/* LIMITING TERMS QUANTITY - To algorithms with a huge cost to lead with a high number of features*/
	PROC SQL;
		CREATE TABLE outparent_train_aux AS
		SELECT _TERMNUM_, STD(_COUNT_) AS FREQ_TERM
		FROM &caslibname..outparent_train
		GROUP BY _TERMNUM_
		ORDER BY FREQ_TERM DESC;
	QUIT;

	DATA &caslibname..outparent_train_aux;
		SET outparent_train_aux(obs=&LimitTermQty);
	RUN;

	PROC SORT DATA=&caslibname..outparent_train_aux;
		BY _TERMNUM_;
	RUN;

	DATA RDMF.outparent_train_aux;
		SET &caslibname..outparent_train_aux;
	RUN;
	
	DATA &caslibname..outparent_train_red;
	MERGE &caslibname..outparent_train (IN=PAR) &caslibname..outparent_train_aux (IN=AUX KEEP=_TERMNUM_);
	BY _TERMNUM_;
	IF AUX;
	RUN;

	proc contents data=&caslibname..outparent_train_red;run;
	
	data &caslibname..outterms_train_aux ;
	set RDMF.outterms_train_aux ;
	run;

	/* Merge */
	DATA RDMF.MATRIZ_TERMOS_DOCUMENTOS_AUX;
	MERGE &caslibname..outparent_train_red (IN=PAR) &caslibname..outterms_train_aux (IN=TERMS KEEP=Term _TERMNUM_);
	BY _TERMNUM_;
	IF PAR;
	RUN;

	proc sort data=RDMF.MATRIZ_TERMOS_DOCUMENTOS_AUX out=RDMF.MATRIZ_TERMOS_DOCUMENTOS_AUX NODUPKEY;
		BY _DOCUMENT_ _TERMNUM_;
	RUN;

	/*Add A Suffix to TF-IF variables*/	
	data  RDMF.MATRIZ_TERMOS_DOCUMENTOS_AUX;
	set  RDMF.MATRIZ_TERMOS_DOCUMENTOS_AUX;
	FORMAT DATASET $20.;
	TF_IDF_ = 'TF_IDF_'; 
	DATASET = 'Train';
	run;


	/*TRANSPOSE to get the matrix*/
	proc transpose data = RDMF.MATRIZ_TERMOS_DOCUMENTOS_AUX 
	out = RDMF.MATRIZ_TERMOS_DOCUMENTOS PREFIX=TF_IDF_;
	by _DOCUMENT_;
	id _TERMNUM_;	
	var _COUNT_;
	run;
	
	/*Finally: After you have to score test data set*/
		/* Score TF-IDF */
	proc tmscore
	  data      = &caslibname..partitioned_data_test 
	  terms     = &caslibname..outterms_train
	  config    = &caslibname..outconfig_train
	  outparent = &caslibname..outparent_test
	  svdu = &caslibname..svdu_train
	  svddocpro = &caslibname..outdocpro_test;
	  doc_id      numero_processo;
	  var  texto;
	run;
	
	/*Do the same previous process to test dataset*/
	/*Do the same process to validation and test datasets and and join all to use as input of your model*/
	
alisio_meneses
Quartz | Level 8

Hello there, @BigRider 

 

Thank you for sharing your code.

 

As I reviewed the SAS code provided, I couldn't locate the specific step where TF and IDF values are multiplied to calculate the TF-IDF scores. If you could kindly point me in the right direction or provide some tips on this, I'd greatly appreciate it.

 

Thanks in advance.

sbxkoenk
SAS Super FREQ

Hello @alisio_meneses ,

 

In PROC TEXTMINE , these are the

Term-by-Document Matrix Creation Options

  • CELLWGT=    Specifies how cells are weighted
  • REDUCEF=     Specifies the frequency for term filtering
  • TERMWGT=  Specifies how terms are weighted

in the PARSE statement.

 

Koen

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 3 replies
  • 2553 views
  • 0 likes
  • 4 in conversation