Solved: Searching for keywords in a column

achapman · Posted 08-09-2023 01:25 PM

I have a list of keywords (175) in column A that I would like to search through column B (paragraph form sentences.) I've tried several different ways and keep getting errors.

data want;
set have;

column_B_lc = lowcase(column_B);

do i = 1 to countw(column_A);
word = scan(column_A, i);

if indexc(column_B_lc, ' ' || word || ' ') > 0 then do;
output;
leave;
end;
end;

drop i word column_B_lc; /* Drop temporary variables used in the loop */
run;

any tips/suggestions are welcome!

Reeza · Posted 08-09-2023 06:46 PM

Here's a fictional example of how I do this.

Create words in a data set (terms)
Determine number of words in list (required to declare temporary array) (num_search_terms)
In data step, load words from terms into a temporary array which can then be looped over in the data step in each row. Temporary arrays are loaded to memory so this is fairly efficient.
Search through and flag for each row. This example just flags a 1 or 0 but you could flag which term was found or if multiple terms are found.

https://gist.github.com/statgeek/2f733d27820f43fa37d6ba92c30f22cf

View solution in original post

data_null__ · Posted 08-09-2023 01:36 PM

Make a simpler example with DATA. You are also using the wrong function to search for words use INDEXW or FINDW.

ballardw · Posted 08-09-2023 03:50 PM

Do you really have 175 different words in a single variable in one observation? That is what your code would be expecting.

The data step does not "search through a column" as I think you are hoping. It operates on mostly one observation at a time.

Any time you have an error you should include the LOG with all of the messages and the code. Copy the text from the log and on the forum open a text box using the </> icon above the message window. Then paste the text.

Tom · Posted 08-09-2023 04:01 PM

If you had something like this

data have;
  infile cards truncover ;
  length a $20 b $80;
  input a b $80.;
cards;
x a b c 
y x y z
;

Then to test if the WORD in A is in the string in B you could do something like:

data want;
  set have;
  found = 0 < findw(b,a,,'sit') ;
run;

Results

Obs    a      b      found

 1     x    a b c      0
 2     y    x y z      1

achapman · Posted 08-10-2023 01:21 PM

This is what I originally tried, but I keep getting 0 for found throughout the entire dataset, which I know is incorrect. Thanks for your help!

Tom · Posted 08-10-2023 01:42 PM

Make sure you are using the right delimiters for the FINDW() function. Make sure to use the i modifier to ignore case and the t modifier to trim the trailing spaces.

Pick just a couple of observations from each to test with so you can be sure about what is happening.

Make sure to look at those values to be sure that there are not any strange invisible characters, TAB, CR, LF, FF, Non-breaking space, null byte, etc.

Try using upcase(),strip() and perhaps compress() or translate() to eliminate bogus characters. Add spaces before and after both the search term and the longer string so you use INDEX() to look for space delimited terms and still match at the start and the end of the string.

index
(cat(' ',upcase(strip(translate(sentence,' ','090A0C0DA000FF'x))),' ')
,cat(' ',upcase(strip(translate(term,' ','090A0C0DA000FF'x))),' ')
)

Reeza · Posted 08-10-2023 01:50 PM

If none of the posted solutions work, please post a small example of your terms and your data. Feel free to make fake data and show the expected output.

mkeintz · Posted 08-09-2023 06:14 PM

Are you saying you have 175 distinct values for column_a (i.e. 175 rows, each with a single word)?

Or does each observation have a very long character variable (column_a) that has 175 words in it?

And will you be using the same 175 words throughout all values of column_b?

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

achapman · Posted 08-09-2023 06:37 PM

Sorry I didn't explain this very well. Yes, I have 175 distinct values (each a single word) that needs to be throughout all values of column_b.

Reeza · Posted 08-09-2023 06:46 PM

Here's a fictional example of how I do this.

Create words in a data set (terms)
Determine number of words in list (required to declare temporary array) (num_search_terms)
In data step, load words from terms into a temporary array which can then be looped over in the data step in each row. Temporary arrays are loaded to memory so this is fairly efficient.
Search through and flag for each row. This example just flags a 1 or 0 but you could flag which term was found or if multiple terms are found.

https://gist.github.com/statgeek/2f733d27820f43fa37d6ba92c30f22cf

achapman · Posted 08-10-2023 02:29 PM

This worked PERFECTLY. Thank you so much, I had tried an array before but I must have had an error somewhere because it was not giving me the right output. Thanks again!

Patrick · Posted 08-09-2023 08:00 PM

What is the desired result? Just identify if your source data matches to at least one word in your list or some table that gives your per row of your source data all the matching words?

Tom · Posted 08-09-2023 11:49 PM

@achapman wrote:

Sorry I didn't explain this very well. Yes, I have 175 distinct values (each a single word) that needs to be throughout all values of column_b.

Sounds like you have two separate datasets. The first one has 175 observations, each of which as a single word.

data terms;
  input term $20.;
cards;
The
Black
Cat
;

And a second dataset that has some other number of observations.

data sentences;
   input sentence $80;
cards;
Our car is black.
She wears a hat.
The cat is black.
;

Perhaps by some strange coincidence that dataset also has 175 observations and that is what caused you to think they formed a single dataset?

if you want to match every term with every sentence then perhaps SQL is the way to go? Perhaps you only want to keep the terms that were found?

proc sql;
create table want as
select a.term
     , findw(b.sentence,a.term,' .!,','site') as wordnum
     , b.sentence
from terms a 
   , sentences b
where calculated wordnum
order by 3,2,1
;
quit;

Result

Obs    term     wordnum        sentence

 1     Black       4       Our car is black.
 2     The         1       The cat is black.
 3     Cat         2       The cat is black.
 4     Black       4       The cat is black.

Searching for keywords in a column

Re: Searching for keywords in a column

Re: Searching for keywords in a column

Re: Searching for keywords in a column

Re: Searching for keywords in a column

Re: Searching for keywords in a column

Re: Searching for keywords in a column

Re: Searching for keywords in a column

Re: Searching for keywords in a column

Re: Searching for keywords in a column

Re: Searching for keywords in a column

Re: Searching for keywords in a column

Re: Searching for keywords in a column

Re: Searching for keywords in a column

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!