BookmarkSubscribeRSS Feed
jhs2171
Obsidian | Level 7

Hello 

 

I am trying to get SAS to read through a list of keywords in an excel spreadsheet and then search for those words in variables from a SAS dataset. The list file only has one column called 'keyword' and has one keyword per row. I want SAS to look for the keyword(s) in two variables called 'title' and 'description' 

 

What I want to do is setting those keywords into a macro and then using the scan and findw functions to search for the words. At the end of the process, I'd like to have a list of records from the SAS datasets that contain any of those keywords including ID, title, description, and the matched keyword(s). 

 

I've used the code below to create a macro. Any changes I should make? From here, I am not sure how to proceed. Any advice/suggestion would be much appreciated. Thanks! 

proc sql;
select keyword into:list 
from wordlist
where keyword ne "" ;
quit;
%put &list;

 

 

15 REPLIES 15
novinosrin
Tourmaline | Level 20

Hi @jhs2171  Are you missing separated by ',' 

proc sql;
select keyword into:list  separated by ','
from wordlist
where keyword ne "" ;
quit;
%put &list;

 to start with? 

Patrick
Opal | Level 21

And you probably want your keywords also in quotes.

proc sql;
select cats("'",keyword,"'") into:list  separated by ','
from wordlist
where keyword ne "" ;
quit;
%put &list;
jhs2171
Obsidian | Level 7
Thanks for your suggestion! Can you please explain why I would want keywords in quotes? Also, why would some keywords end up looking like this: 'Phenotype '. Thanks!
jhs2171
Obsidian | Level 7
I meant extra space between the text and the quote at the end
ballardw
Super User

What are " ID, title, description"? Are these variables? or values that are supposed to be added to your key word list for searching?

Can't really make much in the way of recommendations without knowing what the start data looks like and what the result should actually be. It would be best to provide a small example data set, the key words, and the desired result when using the example data and list.

Some details are also missing. Is the match supposed to be case sensitive or not? Do any of your keywords contain blanks? or possibly "special" characters like *$@#! and the like? or quotes? @Patrick's approach may have issues in use if you have words with single quotes as part of the values, and other approaches would have issues with double quotes.

 

jhs2171
Obsidian | Level 7

Sorry, I should have explained it better. ID, title, and description are variables from the SAS dataset.

Matching doesn't have to be case sensitive. No special characters or quotes. I don't know how to upload a sample dataset here but I hope this helps. 

 

The list in the excel spreadsheet has one column named 'keyword'. Examples: spinal tap, nerve, CNS, Guillain-Barre syndrome, Stroke, etc. 

The SAS dataset has multiple variables including ID, title, description etc.: Apart from ID which was auto-populated, everything else is free text 

I would like the final dataset to have a list of records from the SAS dataset that contains any of the keywords from the list along with its ID, title, and description. 

 

I hope this clarifies things a bit. 

 

 

 

 

 

Patrick
Opal | Level 21

If we find a match are we just writing your source record to a result table or do you need a row for every single match where we also add the information which word matched in which variable at which position (or alternatively that it was the n'th word in the string).

jhs2171
Obsidian | Level 7

I would imagine the final dataset would something like this: 

 

ID      Keyword             Title                                                                       Description                                                                                    

191      Smell                Questions about symptoms                                  I am having trouble smelling...                                                 

200      brain                 inpatient reimbursement brain surgery                Where can I find the reimbursement form?......                             

140     Paralysis           FW: lab sample                                                    Can you provide me guidance on paralysis resources 

 

In addition, it would be nice to know how I can get SAS to grab additional variables (e.g. email or phone number) from the SAS dataset and include them in the final dataset if needed. Incidentally, would the length of the description matter? The description was exported from an email and I think some could go over 32k. 

Patrick
Opal | Level 21

Why does term Smell match with smelling? That's going to make things much harder. What's the rule here? If the word starts with the search term then it's already a match. 

Also: What should happen if more than one search term matches? Or the same search term matches multiple times - eventually in string(s) from the same or different variables.

 

It's normally best to specify things in detail before starting to code anything.

jhs2171
Obsidian | Level 7

So SAS would treat 'smelling' and 'smell' differently because they aren't identical? Hmm.. if that's the case I should probably add a whole bunch of keywords based on proximity/lemmitization-- but that's for another day. Thanks for pointing that out. In the past, I had a slew of keywords ending with a particular string 'XYZ' and coded SAS to pick up all the words/phrases that end with 'XYZ' so thought it would work in a similar way. 

 

If the same search term matches multiple times within a record I would count it as 1. However, if multiple search terms were found in one record, I'd like SAS to be able to identify them and list them as such; in turn, we can end up having multiple rows per ID in the final dataset. I am trying to identify how many of those records were asking about a certain topic using the list of keywords that are associated with the topic. These are really great questions and I appreciate your feedback.

Thank you!  

ballardw
Super User

@jhs2171 wrote:

Sorry, I should have explained it better. ID, title, and description are variables from the SAS dataset.

Matching doesn't have to be case sensitive. No special characters or quotes. I don't know how to upload a sample dataset here but I hope this helps. 

 

The list in the excel spreadsheet has one column named 'keyword'. Examples: spinal tap, nerve, CNS, Guillain-Barre syndrome, Stroke, etc. 

 


Spinal tap contains a space

Guillain-Barre syndrome contains both a dash and a space

 

The characters are important because of how you would have to define "word" in either of FINDW or SCAN function.

In Findw and Scan funtions the default delimiters between words are (ASCII , EBCDIC is similar): blank ! $ % & ( ) * + , - . / ; < ^ |

So your dash and space would cause issues with searching. From your example I really would not be surprised to find () in your list either. Since you are looking at medical terms then is A+  as blood type going to appear in list? or any of the measures that might have % such as medicine solution percentages or oxygen saturation readings?

jhs2171
Obsidian | Level 7
Oi. I see. I just took another look and I don't have any special characters including (), +, $%^*( etc. But you are right--the list contains space and dash which would cause issues with searching. Do you have any other suggestions? Thank you!
s_lassen
Meteorite | Level 14

To me it sounds like you will want to put the keywords into an array. There is no reason to use a macro variable for that, just read the WORDLIST data set in the beginning of your data step:

data _null_; /* get number for words for array dimension */
  call symputx('nwords',nobs); /* now corrected to 'symputx' */
  if 0 then set wordlist nobs=nobs;
  stop;
run;
  
Data want;
  array keywords (&nwords) $20 _temporary_;
  if _N_= 1 then do _I_=1 to &nwords;
    set wordlist;
    keywords(_I_)=keyword;
    end;
  set <sas dataset>;
  length found_in_title found_in_description $200;
  do _I_=1 to &nwords;
    if findw(title,keywords(_I_))>0 then
      call catx(',',found_in_title,keywords(_I_));
    if findw(description,keywords(_I_))>0 then
      call catx(',',found_in_description,keywords(_I_));    
    end;
  if lengthn(found_in_description)>0 or lengthn(found_in_title)>0;
drop keyword; run;

Or you can drop the array and just read the whole keywords table every time 

Data want;
  set <sas dataset>;
  length found_in_title found_in_description $200;
  do _N_=1 to nobs;
    set wordlist point=_N_ nobs=nobs;
    if findw(title,keyword)>0 then
      call catx(',',found_in_title,keyword);
    if findw(description,keyword)>0 then
      call catx(',',found_in_description,keyword);    
    end;
  if lengthn(found_in_description)>0 or lengthn(found_in_title)>0;
drop keyword; run;

To do this you must be able to read the WORDLIST table with POINT=, this may mean that you will have to copy it from Excel to SAS WORK. Which is probably a good idea anyway for this approach.

 

Note: answer has been edited to correct a couple of mistakes.

jhs2171
Obsidian | Level 7
Thanks! I tried the array option and got the following error message: The subroutine SYMPUTS is unknown, or cannot be accessed-- Any suggestion?

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 15 replies
  • 5848 views
  • 3 likes
  • 5 in conversation