Hey SAS Community,
Asked a question yesterday, but wanted to create a new post which contains more of the specifics I am working with.
I have two input tables:
The first table (denoted Have1) contains the columns CodeName and CodeKeywords in the following format:
CodeName CodeKeywords
AIS TEST ais-test | pressure-drop
FIELD INSTRUMENT ERROR psd
GVL TEST GVL drop
LIFT ISSUE lift | slackline
SOLID LOADING pbu
PRODUCTION BOTTLENECK resd
These keywords are entered by engineers in an application and are retrieved by a PROC HTTP call, as such I cannot guarantee the uniqueness of the keywords. As such, I am happy for the first or last match to be the assigned Code in the event of multiple matches. Moreover, the keywords are set up for use with prxmatch such that a keyword like "ais-test | pressure-drop" would match comments containing either "ais-test" OR "pressure-drop".
The second table (denoted Have2) contains the columns Date and Comment - where the comment is text written by an engineer and can be missing:
Date Daily Comment
27MAR2021             Encountered issues with PSD
27MAR2021              
27MAR2021             Had issues with the lift system, no actions taken
27MAR2021             Experienced pressure-drop
27MAR2021             RESD during morning routine
27MAR2021             No issues encountered
So from the above tables, my desired output would be as follows:
Date Daily Comment CodeSugg
27MAR2021              Encountered issues with PSD                                        FIELD INSTRUMENT ERROR  
27MAR2021              
27MAR2021              Had issues with the lift system, no actions taken           LIFT ISSUE  
27MAR2021              Experienced pressure-drop                                            AIS  TEST 
27MAR2021              RESD during morning routine                                        PRODUCTION BOTTLENECK
27MAR2021              No issues encountered
As illustrated by the above tables, I use the keywords to search the comments for a match - for instance, when finding the word RESD in the "RESD during morning routine" comment the comment should be labelled by the Production Bottleneck code. Until now, I've used prxmatch for this purpose. Additionally, I'd like the resulting Want table to have the same number of rows as the Have2 table (i.e., all comments present in resulting Want table). Note also that if there is no match, no Label should be added in the CodeSugg column for the corresponding comment value.
Previously I hardcoded this and it was working satisfactorily:
data want;
	set have;
	%if %symexist(ASV_FAILURE) %then %do;
		if prxmacth("/&ASV_FAILURE./i", loss_comment) then CODE_SUGG = "AVS Failure";
	%end;
run;
I am now trying to create a more "dynamic" code to facilitate for new unknown codes which may be added in the future. For this purpose I've created the following script:
proc sql noprint;
	select count(*)
   into :rowCountHave1
   from Have1;
quit;
%macro suggestCode();
	
	data want;
		%let iter = 1;
		
		set Have2;
		%let comment=DAILY_COMMENT;
	
		%do %while (&iter. <= &rowCountHave1.);
			set Have1(firstobs=&iter. obs=&iter.);
				%let keyword=strip(CodeKeywords);
				%let code=strip(CodeName);
			
				%if prxmatch("/%str(&keyword.)/i", &comment.) %then CODESUGG=&lcode.;;
				%let iter=%eval(&iter.+1);
		%end;
		
	run;
%mend;
%suggestCode();
The rationale behind the above code was that for each comment I would iterate through the table containing the codes and keywords and check for a match. Disclaimer - I have very little experience with looping in SAS and SAS programming in general and this code is currently not working as expected and I encounter the following error without really understanding why:
- ERROR: Required operator not found in expression: prxmatch("/&keyword./i.", &comment.) - ERROR: The macro SUGGESTCODE will stop executing
I figured the easiest solution for my task was to use a do-while loop for each row in the table (Have1) containing codes and keywords and check for a match for each comment in the Have2 table. Would greatly appreciate any suggestions and comments on how far I am off to get this approach to work.
Something like this should work (untested, please adapt):
data WANT;
  if 0 then set HAVE1 nobs=NOBS;
  set HAVE2;
  do I=1 to NOBS;
    set HAVE1 point=I;
    if prxmatch(catt('/',CODEKEYWORDS,'/i'), trim(COMMENT) ) then CODESUGG=CODENAME;
  end;
run;Another method could use a hash table, but the syntax above is simpler.
or if the HAVE1 table is large, load it beforehand using SASFILE.
Hi Reeza, sorry that was an "end of the day" error by me - it should always be matching the code containing the keyword. In this case it should match the code which has the keyword "psd".
I am not set on using macros - the reason it is set up in a (likely) less-then-effective manner can probably be attributed to my lack of experience with the SAS programming language. This approach was the one that I intuitively thought of and I have not yet worked with arrays in SAS.
How consistent are these for containing the "key words"? I ask because your first example value of " Encountered issues with PSD" contains none of the "keywords" of "ais -test" or "pressure drop". if PSD stands for "pressure drop" then it appears that you should have "PSD" in the keywords for Ais Test. But you have PSD in the keywords for "Field Instrument Error" and that is not the result for the data value of " Encountered issues with PSD".
It seems like your rules may not be firm or rigorous enough to be coding yet.
Would "slack line" be considered the same as "slackline"?
HI Ballardw,
Like mentioned in the post above, the first instance was an error by me. The example has been updated and is now labelled appropriately.
I am actually a bit unsure if "slack line" would match "slackline", I'd have to perform a test when getting to work. However, for the purpose of my code you can assume that it should be exact text matching.
Here is an example of one way to use a temporary array to search for specific "keywords" in text.
data textlines;
   infile datalines truncover;
   input text $ 1-30 ;
datalines;
Some string value
Another text line
Junk Junk Junk
;
data arrayexample;
   set textlines;
   array key (5) $ 15 _temporary_ ('TEXT' 'LINE' 'STRING' 'JUNK' 'ABC');
   array r   (5) ;
   do i=1 to dim(key);
      r[i]= index(upcase(text),strip(key[i]))>0;
   end;
   drop i;
run;
In this case I am placing a result of "found" into a 1/0 coded result variable r where 1 will indicate 'word found'. Note that R1=1 means the first word by order in the Key list is found, R2=2 means the second and so one.
I start with this because we have no idea just how complex your comments might be and this will find the cases where multiple keywords are found.
For those meanings with multiple key words a separate pass through the R array variables could be used to set another variable using the Max function with two or more R variables such as:
Text_or_line = max(R[1], R[2]); to create 1/0 flag numeric variable or if you absolutely NEVER have multiple code values you could use a series of IF/Then such as this to set a single value:
If max(R[1],R[2]) = 1 then codevalue='Text or line found';
Else if R[3]=1 then codevalue= 'String found';
else if R[4]=1 then codevalue= 'Junk found';
else if R[5]=1 then codevalue= 'ABC found';
Which could also be written as a Select
   select ;
      When (R[1],R[2]) codevalue='Text or line found';
      When (R[3])      codevalue= 'String found';
      When (R[4])      codevalue= 'Junk found';
      When (R[5])      codevalue= 'ABC found';
      otherwise;
   end;
With either of these you likely want to set a length for the "Codevalue" variable before assigning a value to control the length of the text.
The _temporary_ in the Key array means that the values are not added to the data set. When creating such an array you need to make sure that specify a number of elements (5) in the example, the length of the longest text value, $ 15 in the example, and the values in a desired order. Note that each individual value is separate, no "list" unless you want to parse each value as well. The comparison I used, INDEX will find "string" if it appears in "stringcombo" so you may want the INDEXW or FINDW function instead. Strip is used on the Key values because some of the comparisons will "pad" the value to the length of the defined value not find "string" because it is looking for "string ". Strip removes the trailing padded values. I'm sort of old school and when looking for exact matches pretty much always force case to match. The function FINDW has a setting for ignore case for comparisons.
Thank you for your thorough response @ballardw, I have not worked with arrays in SAS before so this was a very interesting approach.
I will try and implement a working example from the code you have created based on my data.
Something like this should work (untested, please adapt):
data WANT;
  if 0 then set HAVE1 nobs=NOBS;
  set HAVE2;
  do I=1 to NOBS;
    set HAVE1 point=I;
    if prxmatch(catt('/',CODEKEYWORDS,'/i'), trim(COMMENT) ) then CODESUGG=CODENAME;
  end;
run;Another method could use a hash table, but the syntax above is simpler.
or if the HAVE1 table is large, load it beforehand using SASFILE.
Thanks @ChrisNZ, this worked like a charm! I like the simplicity of the syntax as it will ease the possible debugging other team members will have to perform in the future.
Out of curiosity, and with no expactation for you to spend more time on my issue, I tried a different apporach for solving my issue for my data using macro functions:
%macro loopOverComments();
	data WANT;
		set HAVE2;
		
		call symputx('comment', COMMENT, 'g');
		call execute('%findCode');
		if symexist('code') then do;
			sugg_loss = symget('code');
			call symdel('code');
		end;
	run;
%mend;
%macro findCode();
	data temp;
		set HAVE1;
		call symputx('keywords', CodeKeywords, 'g');
		if prxmatch("/&keywords/i", "&comment") then do;
			call symputx('code', CodeName, 'g');
			stop;
		end;
	run;
%mend;
%loopOverComments();	Would you happen to know why this approach is not working? I have tested the macro variables and, for instance, the "comment" macro is available in the %findcode macro function.
Call execute stacks up commands to execute after the data step finishes.
It looks like you are expecting the result of one data step, the "if symexist('code') to be available in the other and the timing is just not there.
There is another issue about the timing of using a macro variable created in a data step later in the same data step.
Run the code with: options mprint; set and you may see more details about how the macro calls are behaving.
Would you happen to know why this approach is not working?
As mentioned call execute stacks the code to run after the data step, unless it's pure macro code.
Consider:
%let a=0;
data _null_;
  RC= dosubl(" data _null_; call symput('a','1'); run; ") ;
  A=symget('a');
  put A=;
run;
data _null_;
  call execute(" %let a=2; ") ;
  A=symget('a');
  put A=;
run;
A=1
A=2
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.
