I have a list of keywords (175) in column A that I would like to search through column B (paragraph form sentences.) I've tried several different ways and keep getting errors.
data want;
set have;
column_B_lc = lowcase(column_B);
do i = 1 to countw(column_A);
word = scan(column_A, i);
if indexc(column_B_lc, ' ' || word || ' ') > 0 then do;
output;
leave;
end;
end;
drop i word column_B_lc; /* Drop temporary variables used in the loop */
run;
any tips/suggestions are welcome!
Here's a fictional example of how I do this.
https://gist.github.com/statgeek/2f733d27820f43fa37d6ba92c30f22cf
Make a simpler example with DATA. You are also using the wrong function to search for words use INDEXW or FINDW.
Do you really have 175 different words in a single variable in one observation? That is what your code would be expecting.
The data step does not "search through a column" as I think you are hoping. It operates on mostly one observation at a time.
Any time you have an error you should include the LOG with all of the messages and the code. Copy the text from the log and on the forum open a text box using the </> icon above the message window. Then paste the text.
If you had something like this
data have;
infile cards truncover ;
length a $20 b $80;
input a b $80.;
cards;
x a b c
y x y z
;
Then to test if the WORD in A is in the string in B you could do something like:
data want;
set have;
found = 0 < findw(b,a,,'sit') ;
run;
Results
Obs a b found 1 x a b c 0 2 y x y z 1
This is what I originally tried, but I keep getting 0 for found throughout the entire dataset, which I know is incorrect. Thanks for your help!
Make sure you are using the right delimiters for the FINDW() function. Make sure to use the i modifier to ignore case and the t modifier to trim the trailing spaces.
Pick just a couple of observations from each to test with so you can be sure about what is happening.
Make sure to look at those values to be sure that there are not any strange invisible characters, TAB, CR, LF, FF, Non-breaking space, null byte, etc.
Try using upcase(),strip() and perhaps compress() or translate() to eliminate bogus characters. Add spaces before and after both the search term and the longer string so you use INDEX() to look for space delimited terms and still match at the start and the end of the string.
index
(cat(' ',upcase(strip(translate(sentence,' ','090A0C0DA000FF'x))),' ')
,cat(' ',upcase(strip(translate(term,' ','090A0C0DA000FF'x))),' ')
)
If none of the posted solutions work, please post a small example of your terms and your data. Feel free to make fake data and show the expected output.
Are you saying you have 175 distinct values for column_a (i.e. 175 rows, each with a single word)?
Or does each observation have a very long character variable (column_a) that has 175 words in it?
And will you be using the same 175 words throughout all values of column_b?
Sorry I didn't explain this very well. Yes, I have 175 distinct values (each a single word) that needs to be throughout all values of column_b.
Here's a fictional example of how I do this.
https://gist.github.com/statgeek/2f733d27820f43fa37d6ba92c30f22cf
This worked PERFECTLY. Thank you so much, I had tried an array before but I must have had an error somewhere because it was not giving me the right output. Thanks again!
What is the desired result? Just identify if your source data matches to at least one word in your list or some table that gives your per row of your source data all the matching words?
@achapman wrote:
Sorry I didn't explain this very well. Yes, I have 175 distinct values (each a single word) that needs to be throughout all values of column_b.
Sounds like you have two separate datasets. The first one has 175 observations, each of which as a single word.
data terms;
input term $20.;
cards;
The
Black
Cat
;
And a second dataset that has some other number of observations.
data sentences;
input sentence $80;
cards;
Our car is black.
She wears a hat.
The cat is black.
;
Perhaps by some strange coincidence that dataset also has 175 observations and that is what caused you to think they formed a single dataset?
if you want to match every term with every sentence then perhaps SQL is the way to go? Perhaps you only want to keep the terms that were found?
proc sql;
create table want as
select a.term
, findw(b.sentence,a.term,' .!,','site') as wordnum
, b.sentence
from terms a
, sentences b
where calculated wordnum
order by 3,2,1
;
quit;
Result
Obs term wordnum sentence 1 Black 4 Our car is black. 2 The 1 The cat is black. 3 Cat 2 The cat is black. 4 Black 4 The cat is black.
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.