BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
antony1
Fluorite | Level 6

Hello,

I am trying to build a Macro program for simple text analysis following the methodology of the authors in Paper 2557-2018 "A simple approach to text analysis using SAS functions". (Available at <https://www.sas.com/content/dam/SAS/support/en/sas-global-forum-proceedings/2018/2557-2018.pdf>). However I have been unable to replicate the results that the authors achieved for my own data, including when using the data example the authors provided sashelp.adsmsg. I believe that this is due to how SAS is reading the selected variable in the INDEX and INDEXW functions in the MACRO, although I am still learning SAS and my debugging skills leave a bit to be desired. Any help or ideas would be greatly appreciated!  

I have included my full code use to make the program below. The first module works well:

 

 

 

%LET ds_name = sashelp.adsmsg;
%LET Var_Data_Source = TEXT; 

DATA Bag_of_words;
	SET &ds_name;
	Var_Data_Source=COMPBL(TRANSLATE(&Var_Data_Source, " " , "."","";"":""?""!""-""/""\""
														   ""%""1""2""3""4""5""6""7""8""9""0""$""@""#"")""("));
	N_words=COUNTW(&Var_Data_Source);
	ARRAY word {1:1000} $50 _TEMPORARY_;
		DO i=1 TO N_words;
		RETAIN word_i;
		word(i)=SCAN( Var_Data_Source, i, ' ' );
		word_i=UPCASE(word(i));
		IF (i>1) THEN word_2i= UPCASE(catx(" ",word(i-1), word_i));
		IF (i>2) THEN word_3i= UPCASE(catx(" ",word(i-2), word(i-1), word_i));
		IF (i>3) THEN word_4i= UPCASE(catx(" ",word(i-3), word(i-2), word(i-1), word_i));
		IF (i>4) THEN word_5i= UPCASE(catx(" ",word(i-4), word(i-3), word(i-2), word(i-1), word_i));
		OUTPUT;
	END;
RUN;

PROC PRINT DATA = Bag_of_words; RUN;
PROC FREQ DATA = Bag_of_words; TABLE word_i word_2i word_3i word_4i/NOCUM; RUN;


Data Term_matrix;
	INFILE datalines;
	INPUT in_words $1-50;
		in_words=TRANSLATE(UPCASE(in_words), "" , '.'','';'':''?''!''-''/''\');
		out_words=SUBSTR(TRANSLATE(STRIP(in_words),"_"," "),1,32);
	DATALINES;
		cultion
		calcultion
		calculation
		;
	RUN;
	PROC PRINT DATA= Term_matrix; RUN;

The second module uses the Macro program (listed later):

 

 

%INCLUDE "/folders/myfolders/EPG194/Macro_KW_search.sas";

DATA _NULL_;
 SET Term_matrix END=eof;
in_words_1=STRIP("'"||in_words||"'");
CALL SYMPUT("Keyword_in",UPCASE(in_words_1));
CALL SYMPUT("Keyword_out",out_words);
IF (_N_=1) THEN DO;
 CALL EXECUTE('%CERTAINTY_FACTOR (Data_IN=&ds_name, Data_OUT=Term_Doc_Matrix)');
END;
CALL EXECUTE('%KW_SEARCH (KW=&Keyword_in, Var_KW_out=&Keyword_out, Var_Target_doc=TEXT,
 Data_IN=Term_Doc_Matrix, Data_OUT=Term_Doc_Matrix)');
RUN;

PROC SQL NOPRINT;
	SELECT out_words INTO: v_list_comma_sep separated BY ',' FROM Term_matrix;
	SELECT in_words INTO: v_list_blank_sep separated BY ' ' FROM Term_matrix;
QUIT;

PROC TABULATE DATA=Term_Doc_Matrix;
	CLASS certainty_factor TEXT;
	Var &v_list_blank_sep;
	Table (&v_list_blank_sep), (certainty_factor='certainty factor' ALL='No of Terms found')*SUM='' N='No of Documents';
	Table TEXT='Target Documents'*(&v_list_blank_sep), (certainty_factor='certainty factor' ALL='Total')*SUM='';
RUN;

This is the Macro program that the authors use and where I believe that I have a problem, although I am not sure. You can see that if the INDEXW and INDEX functions return a positive value that this should then be recorded in the temporary array and then in a count variable, however I am unable to achieve this. I tried debugging the program directly by using the INDEX and INDEXW functions directly with the variable value that I was looking to match (I am not sure if this is very conventional) and had values returned for Target_truncated but not for KW variable - leading me to thing that the issue is with the KW variable. 

 

%MACRO CERTAINTY_FACTOR ( Data_OUT=, Data_IN=);
DATA &Data_OUT;
SET &Data_IN;
	DO Certainty_factor=1 to 3;
		OUTPUT;
	END;
%MEND;

%MACRO KW_SEARCH(KW=, Var_KW_in=, Var_KW_out=, Data_IN=, Data_OUT=, Var_Target_Doc=);
 DATA &Data_OUT Replace;
 ATTRIB Flag_success length=3;
 KW= UPCASE(&KW);
 SET &Data_IN;
 Target_string=UPCASE(COMPBL(TRANSLATE(&Var_Target_Doc, " " , ".%,;:?!-/\")));
 KW_words = COUNTW(KW);
 N_words = COUNTW(Target_string); 

 ARRAY word {1:1000} $50 _TEMPORARY_ ; 
 ARRAY IdW {1:100} _TEMPORARY_ ; 
 ARRAY Idx {1:100} _TEMPORARY_ ; 
 ARRAY Sdx {1:100} _TEMPORARY_ ;
 Soundex_Count=0; Index_count=0; IndexW_count=0; 
 
DO i=1 TO (N_words);
		word(i)=SCAN( Target_string, i, ' ' ); 
		IF i GE(KW_words) THEN DO;
 		length Target_truncated $50;
 		Target_truncated='';
 			DO j=1 TO KW_words;
			 Target_truncated= UPCASE(STRIP(CATX(" ", word(i-j+1) ,Target_truncated)));
 			END;
		END;

IF (INDEXW(Target_truncated, STRIP(KW))>0) THEN IdW(i)=1; ELSE IdW(i)=0;
IF (INDEX(COMPRESS(Target_truncated), COMPRESS(KW))>0) THEN Idx(i) =1; ELSE Idx(i) =0; 
IF (INDEX(SOUNDEX(Target_truncated), SOUNDEX(KW))>0) THEN Sdx(i) =1; ELSE Sdx(i) =0;
 
IndexW_count=IndexW_count + IdW(i);
Index_count=Index_count + Idx(i);
Soundex_Count=Soundex_Count + Sdx(i);

END;


IF (Certainty_factor=1) AND (IndexW_count>=0) THEN DO;
&Var_KW_out=IndexW_count; END;
IF (Certainty_factor=2) AND ((Index_count-IndexW_count)>=0) THEN DO;
&Var_KW_out=(Index_count-IndexW_count); END;
IF (Certainty_factor=3) AND ((Soundex_Count-IndexW_count)>=0) THEN DO;
&Var_KW_out=(Soundex_Count-IndexW_count); END;
IF &Var_KW_out>=0 THEN Flag_success=1; 

DROP i j KW KW_words N_words Soundex_Count Index_count IndexW_count
Target_truncated Target_string;
%MEND;


If you have any ideas about how I can figure out what the problem might be for me I would really appreciate the help. Thanks in advance!   

1 ACCEPTED SOLUTION

Accepted Solutions
antony1
Fluorite | Level 6

An update for anyone that comes across this thread. Modules 1 and 2 (in the main program) work and the issue I was having related to Macro parameters.

In this instance I had incorrectly structured the input parameters in the Macro program (wrong spot). A correction of this can be seen in the code below. This was only realised thanks to the first author of the program. I encourage anyone with an interest in this thread to read A simple approach to text analysis using SAS functions’ Paper 2557-2018 from the 2018 SAS Global Forum (The code in this thread was developed by the authors of this paper and is explained in it) being an incredibly insightful and relevant paper and contribution. 

%MACRO KW_SEARCH(KW=, Var_KW_in=, Var_KW_out=, Data_IN=, Data_OUT=, Var_Target_Doc=);
 DATA &Data_OUT Replace;
 ATTRIB Flag_success length=3;
 KW= UPCASE(&KW);
 SET &Data_IN;
 Target_string=UPCASE(COMPBL(TRANSLATE(&Var_Target_Doc, " " , ".%,;:?!-/\")));
 KW_words = COUNTW(KW);
 N_words = COUNTW(Target_string); 

 ARRAY word {1:1000} $50 _TEMPORARY_ ; 
 ARRAY IdW {1:100} _TEMPORARY_ ; 
 ARRAY Idx {1:100} _TEMPORARY_ ; 
 ARRAY Sdx {1:100} _TEMPORARY_ ;
 Soundex_Count=0; Index_count=0; IndexW_count=0; 
 
DO i=1 TO (N_words);
		word(i)=SCAN( Target_string, i, ' ' ); 
		IF i GE(KW_words) THEN DO;
 		length Target_truncated $50;
 		Target_truncated='';
 			DO j=1 TO KW_words;
			 Target_truncated= UPCASE(STRIP(CATX(" ", word(i-j+1) ,Target_truncated)));
 			END;
		END;

IF (INDEXW(Target_truncated, STRIP(KW))>0) THEN IdW(i)=1; ELSE IdW(i)=0;
IF (INDEX(COMPRESS(Target_truncated), COMPRESS(KW))>0) THEN Idx(i) =1; ELSE Idx(i) =0; 
IF (INDEX(SOUNDEX(Target_truncated), SOUNDEX(KW))>0) THEN Sdx(i) =1; ELSE Sdx(i) =0;
 
IndexW_count=IndexW_count + IdW(i);
Index_count=Index_count + Idx(i);
Soundex_Count=Soundex_Count + Sdx(i);

END;


IF (Certainty_factor=1) AND (IndexW_count>0) THEN DO;
&Var_KW_out=IndexW_count; END;
IF (Certainty_factor=2) AND ((Index_count-IndexW_count)>0) THEN DO;
&Var_KW_out=(Index_count-IndexW_count); END;
IF (Certainty_factor=3) AND ((Soundex_Count-IndexW_count)>0) THEN DO;
&Var_KW_out=(Soundex_Count-IndexW_count); END;
IF &Var_KW_out>0 THEN Flag_success=1; 
DROP i j KW KW_words N_words Soundex_Count Index_count IndexW_count Target_truncated Target_string;
%MEND;

%MACRO CERTAINTY_FACTOR ( Data_OUT=, Data_IN=);
DATA &Data_OUT;
SET &Data_IN;
	DO Certainty_factor=1 to 3;
		OUTPUT;
	END;
%MEND;



View solution in original post

3 REPLIES 3
ballardw
Super User

I tried debugging the program directly by using the INDEX and INDEXW functions directly with the variable value that I was looking to match (I am not sure if this is very conventional) and had values returned for Target_truncated but not for KW variable - leading me to thing that the issue is with the KW variable.  

 

Very likely that is the approach to use. It would help if showed the data and the exact code you used for this test and the expected/desired output.

 

The SASHELP.ADMSG data set is not a default install for all SAS installs, such as mine. So no data to test the code with.

 

 

 

 

antony1
Fluorite | Level 6

Thanks so much for the reply, I really appreciate it (and the contributions I have read of yours in other threads).

Sorry for not including the data, I didn't realise it was outside of default install. I have included some of the data below (9 obs with 2 variables).

data have;
infile datalines dsd;
input MSGID TEXT ~ $200.;
datalines;
3850, You need to select at least one group.
0410, You must specify a location for the computed level.
0453, You must select a calculation type before you can
1658, The name "%$" is not a valid SAS name.
6110, Synchronization in process...
6372, Start of default install routine
0453, specify the columns to use in the calculation .
1554, Select OK to continue, Cancel to quit.
0297, Column is used in calculation of %1$.
;
run;

 

 

The desired output is the first table produced in the proc tabulate statement with the expected result: 

  certainty factor No of Terms foundNo of Documents
 123  
CULTION03033
CALCULTION00333
CALCULATION30033

 

As for the test in the MACRO program in the DATA step I created an array and used the IF statement to see if it would assign a value:


ARRAY a {1:100};

 

and tested inside the DO loop with the statement:
   

 

IF (INDEX(Target_truncated, "CULTION")>0) THEN a(i) =1;


Here is the full code of the MACRO test: 

 

%MACRO KW_SEARCH(KW=, Var_KW_in=, Var_KW_out=, Data_IN=, Data_OUT=, Var_Target_Doc=);
 DATA &Data_OUT Replace;
 ATTRIB Flag_success length=3;
 KW= UPCASE(&KW);
 SET &Data_IN;
 Target_string=UPCASE(COMPBL(TRANSLATE(&Var_Target_Doc, " " , ".%,;:?!-/\")));
 KW_words = COUNTW(KW);
 N_words = COUNTW(Target_string); 

 ARRAY word {1:1000} $50 _TEMPORARY_ ; 
 ARRAY IdW {1:100} _TEMPORARY_ ; 
 ARRAY Idx {1:100} _TEMPORARY_ ; 
 ARRAY Sdx {1:100} _TEMPORARY_ ;
 ARRAY a {1:100}; 
 Soundex_Count=0; Index_count=0; IndexW_count=0; 
 
DO i=1 TO (N_words);
		word(i)=SCAN( Target_string, i, ' ' ); 
		IF i GE(KW_words) THEN DO;
 		length Target_truncated $50;
 		Target_truncated='';
 			DO j=1 TO KW_words;
			 Target_truncated= UPCASE(STRIP(CATX(" ", word(i-j+1) ,Target_truncated)));
 			END;
		END;
IF (INDEX(Target_truncated, "CULTION")>0) THEN a(i) =1;
IF (INDEXW(Target_truncated, STRIP(KW))>0) THEN IdW(i)=1; ELSE IdW(i)=0;
IF (INDEX(COMPRESS(Target_truncated), COMPRESS(KW))>0) THEN Idx(i) =1; ELSE Idx(i) =0; 
IF (INDEX(SOUNDEX(Target_truncated), SOUNDEX(KW))>0) THEN Sdx(i) =1; ELSE Sdx(i) =0;
 
IndexW_count=IndexW_count + IdW(i);
Index_count=Index_count + Idx(i);
Soundex_Count=Soundex_Count + Sdx(i);

END;


IF (Certainty_factor=1) AND (IndexW_count>=0) THEN DO;
&Var_KW_out=IndexW_count; END;
IF (Certainty_factor=2) AND ((Index_count-IndexW_count)>=0) THEN DO;
&Var_KW_out=(Index_count-IndexW_count); END;
IF (Certainty_factor=3) AND ((Soundex_Count-IndexW_count)>=0) THEN DO;
&Var_KW_out=(Soundex_Count-IndexW_count); END;
IF &Var_KW_out>0 THEN Flag_success=1; 


As I said my understanding of SAS is still elementary so it may be that the problem lies elsewhere, that I'm not aware of.
Thanks again.

 

 

antony1
Fluorite | Level 6

An update for anyone that comes across this thread. Modules 1 and 2 (in the main program) work and the issue I was having related to Macro parameters.

In this instance I had incorrectly structured the input parameters in the Macro program (wrong spot). A correction of this can be seen in the code below. This was only realised thanks to the first author of the program. I encourage anyone with an interest in this thread to read A simple approach to text analysis using SAS functions’ Paper 2557-2018 from the 2018 SAS Global Forum (The code in this thread was developed by the authors of this paper and is explained in it) being an incredibly insightful and relevant paper and contribution. 

%MACRO KW_SEARCH(KW=, Var_KW_in=, Var_KW_out=, Data_IN=, Data_OUT=, Var_Target_Doc=);
 DATA &Data_OUT Replace;
 ATTRIB Flag_success length=3;
 KW= UPCASE(&KW);
 SET &Data_IN;
 Target_string=UPCASE(COMPBL(TRANSLATE(&Var_Target_Doc, " " , ".%,;:?!-/\")));
 KW_words = COUNTW(KW);
 N_words = COUNTW(Target_string); 

 ARRAY word {1:1000} $50 _TEMPORARY_ ; 
 ARRAY IdW {1:100} _TEMPORARY_ ; 
 ARRAY Idx {1:100} _TEMPORARY_ ; 
 ARRAY Sdx {1:100} _TEMPORARY_ ;
 Soundex_Count=0; Index_count=0; IndexW_count=0; 
 
DO i=1 TO (N_words);
		word(i)=SCAN( Target_string, i, ' ' ); 
		IF i GE(KW_words) THEN DO;
 		length Target_truncated $50;
 		Target_truncated='';
 			DO j=1 TO KW_words;
			 Target_truncated= UPCASE(STRIP(CATX(" ", word(i-j+1) ,Target_truncated)));
 			END;
		END;

IF (INDEXW(Target_truncated, STRIP(KW))>0) THEN IdW(i)=1; ELSE IdW(i)=0;
IF (INDEX(COMPRESS(Target_truncated), COMPRESS(KW))>0) THEN Idx(i) =1; ELSE Idx(i) =0; 
IF (INDEX(SOUNDEX(Target_truncated), SOUNDEX(KW))>0) THEN Sdx(i) =1; ELSE Sdx(i) =0;
 
IndexW_count=IndexW_count + IdW(i);
Index_count=Index_count + Idx(i);
Soundex_Count=Soundex_Count + Sdx(i);

END;


IF (Certainty_factor=1) AND (IndexW_count>0) THEN DO;
&Var_KW_out=IndexW_count; END;
IF (Certainty_factor=2) AND ((Index_count-IndexW_count)>0) THEN DO;
&Var_KW_out=(Index_count-IndexW_count); END;
IF (Certainty_factor=3) AND ((Soundex_Count-IndexW_count)>0) THEN DO;
&Var_KW_out=(Soundex_Count-IndexW_count); END;
IF &Var_KW_out>0 THEN Flag_success=1; 
DROP i j KW KW_words N_words Soundex_Count Index_count IndexW_count Target_truncated Target_string;
%MEND;

%MACRO CERTAINTY_FACTOR ( Data_OUT=, Data_IN=);
DATA &Data_OUT;
SET &Data_IN;
	DO Certainty_factor=1 to 3;
		OUTPUT;
	END;
%MEND;



sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 3 replies
  • 901 views
  • 0 likes
  • 2 in conversation