BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
achapman
Fluorite | Level 6

I have a list of keywords (175) in column A that I would like to search through column B (paragraph form sentences.) I've tried several different ways and keep getting errors. 

 

data want;
set have;


column_B_lc = lowcase(column_B);


do i = 1 to countw(column_A);
word = scan(column_A, i);


if indexc(column_B_lc, ' ' || word || ' ') > 0 then do;
output;
leave;
end;
end;

drop i word column_B_lc; /* Drop temporary variables used in the loop */
run;

 

 

any tips/suggestions are welcome! 

1 ACCEPTED SOLUTION

Accepted Solutions
Reeza
Super User

Here's a fictional example of how I do this. 

 

  • Create words in a data set  (terms)
  • Determine number of words in list (required to declare temporary array) (num_search_terms)
  • In data step, load words from terms into a temporary array which can then be looped over in the data step in each row. Temporary arrays are loaded to memory so this is fairly efficient.
  • Search through and flag for each row. This example just flags a 1 or 0 but you could flag which term was found or if multiple terms are found. 

 

https://gist.github.com/statgeek/2f733d27820f43fa37d6ba92c30f22cf

 

 

View solution in original post

12 REPLIES 12
data_null__
Jade | Level 19

Make a simpler example with DATA.  You are also using the wrong function to search for words use INDEXW or FINDW.

ballardw
Super User

Do you really have 175 different words in a single variable in one observation?  That is what your code would be expecting.

The data step does not "search through a column" as I think you are hoping. It operates on mostly one observation at a time.

 

Any time you have an error you should include the LOG with all of the messages and the code. Copy the text from the log and on the forum open a text box using the </> icon above the message window. Then paste the text.

 

 

Tom
Super User Tom
Super User

If you had something like this

data have;
  infile cards truncover ;
  length a $20 b $80;
  input a b $80.;
cards;
x a b c 
y x y z
;

Then to test if the WORD in A is in the string in B you could do something like:

data want;
  set have;
  found = 0 < findw(b,a,,'sit') ;
run;

Results

Obs    a      b      found

 1     x    a b c      0
 2     y    x y z      1

achapman
Fluorite | Level 6

This is what I originally tried, but I keep getting 0 for found throughout the entire dataset, which I know is incorrect. Thanks for your help!

Tom
Super User Tom
Super User

Make sure you are using the right delimiters for the FINDW() function.  Make sure to use the i modifier to ignore case and the t modifier to trim the trailing spaces.

Pick just a couple of observations from each to test with so you can be sure about what is happening.

Make sure to look at those values to be sure that there are not any strange invisible characters,  TAB, CR, LF, FF, Non-breaking space, null byte, etc.

 

Try using upcase(),strip() and perhaps compress() or translate() to eliminate bogus characters.  Add spaces before and after both the search term and the longer string so you use INDEX() to look for space delimited terms and still match at the start and the end of the string.

index
(cat(' ',upcase(strip(translate(sentence,' ','090A0C0DA000FF'x))),' ')
,cat(' ',upcase(strip(translate(term,' ','090A0C0DA000FF'x))),' ')
)
Reeza
Super User

 If none of the posted solutions work, please post a small example of your terms and your data. Feel free to make fake data and show the expected output.

 

 

mkeintz
PROC Star

Are you saying you have 175 distinct values for column_a (i.e. 175 rows, each with a single word)?

 

Or does each observation have a very long character variable (column_a) that has 175 words in it?

 

And will you be using the same 175 words throughout all values of column_b?

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
achapman
Fluorite | Level 6

 Sorry I didn't explain this very well. Yes, I have 175 distinct values (each a single word) that needs to be throughout all values of column_b.

Reeza
Super User

Here's a fictional example of how I do this. 

 

  • Create words in a data set  (terms)
  • Determine number of words in list (required to declare temporary array) (num_search_terms)
  • In data step, load words from terms into a temporary array which can then be looped over in the data step in each row. Temporary arrays are loaded to memory so this is fairly efficient.
  • Search through and flag for each row. This example just flags a 1 or 0 but you could flag which term was found or if multiple terms are found. 

 

https://gist.github.com/statgeek/2f733d27820f43fa37d6ba92c30f22cf

 

 

achapman
Fluorite | Level 6

This worked PERFECTLY.  Thank you so much, I had tried an array before but I must have had an error somewhere because it was not giving me the right output.  Thanks again!

Patrick
Opal | Level 21

What is the desired result? Just identify if your source data matches to at least one word in your list or some table that gives your per row of your source data all the matching words?

 

Tom
Super User Tom
Super User

@achapman wrote:

 Sorry I didn't explain this very well. Yes, I have 175 distinct values (each a single word) that needs to be throughout all values of column_b.


Sounds like you have two separate datasets.  The first one has 175 observations, each of which as a single word.

data terms;
  input term $20.;
cards;
The
Black
Cat
;

And a second dataset that has some other number of observations. 

data sentences;
   input sentence $80;
cards;
Our car is black.
She wears a hat.
The cat is black.
;

Perhaps by some strange coincidence that dataset also has 175 observations and that is what caused you to think they formed a single dataset?

 

if you want to match every term with every sentence then perhaps SQL is the way to go? Perhaps you only want to keep the terms that were found?

proc sql;
create table want as
select a.term
     , findw(b.sentence,a.term,' .!,','site') as wordnum
     , b.sentence
from terms a 
   , sentences b
where calculated wordnum
order by 3,2,1
;
quit;

Result

Obs    term     wordnum        sentence

 1     Black       4       Our car is black.
 2     The         1       The cat is black.
 3     Cat         2       The cat is black.
 4     Black       4       The cat is black.

 

 

 

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 12 replies
  • 1467 views
  • 2 likes
  • 7 in conversation