Solved: Re: Counting Dictionary Matches in Text Data

alisio_meneses · Posted 11-20-2023 10:24 AM

Hello there!

I am struggling with a simple task: I have two tables. The first one is a word dictionary named ''DICTIONARY", containing a single field named "word." The second table is called TEXT_DATASET, which includes a single field named 'text,' where each row consists of text. Using SAS, how to add a field to the TEXT_DATASET

table, named 'dict_word_count,' which will store the count of words from the DICTIONARY table that exist in each 'text' field of the TEXT_DATASET table?

The following tables are samples:

Table DICTIONARY:

word

apple

banana

orange

grape

Table TEXT_DATASET:

text

This is an apple and a banana

I love eating oranges.

Grapes are delicious, especially red grapes.

I have an apple and an orange for lunch.

Resulting table TEXT_DATASET_WITH_COUNT:

text	dict_word_count
This is an apple and a banana	2
I love eating oranges.	1
Grapes are delicious, especially red grapes.	1
I have an apple and an orange for lunch.	2

Any ideas?

Thank you!

alisio_meneses · Posted 12-15-2023 06:36 AM

I've managed to accomplish the task using PROC TEXTMINE. Apart from completing the specified task, it enables the use of other features such as stemming, entity detection, and so on.

The following code does the trick and, aside from the sample tables code, it runs in CAS.

Thank you all for helping!

Sample tables:

/* dictionary words*/
DATA mycas.dictionary;
   length word varchar(*);
   input word;
   cards;
   apple
   banana
   orange
   grape
   ;
run;

/* text and associated IDs */
data mycas.TEXT_DATASET;
   infile datalines delimiter='|' missover;
   length text varchar(*);
   input text$ id;
   datalines;
   This is an apple and a banana|1
   I love eating an oranges for lunch|2
   Grapes are delicious, especially red grapes|3
   I have an apple and an orange for lunch|4
   ;
run;

And the word count script.

/*************************************************************
* Count dictionary matches in text
* Author: Antonio Alisio de Meneses Cordeiro
* Date: 10/12/2023
* Description: This SAS program performs text mining on a dataset
*              containing textual information. It includes the
*              creation of a dictionary, parsing of text data,
*              and counting the frequency of dictionary words
*              in each document.
*************************************************************/

/* Connect to CAS session */
cas mysession;

/* Assign a library reference to the CAS session */
libname mycas cas sessref=mysession;

/* Perform text mining on the 'mycas.TEXT_DATASET' table */
proc textmine data=mycas.TEXT_DATASET;
   doc_id id;
   var text;
   parse
      notagging nonoungroups
      termwgt        = none
      cellwgt        = none
      reducef        = 1
      entities       = none
      outpos          = mycas.outpos
      ;
run;

/* Create a new table store parsed text information */
data mycas.outpos/sessref=mysession;
   set mycas.outpos;
   format Term;
   format Parent;
   format DOCUMENT;
run;

/* Filter out only the words in the dictionary from the parsed text */
data mycas.outpos_filtered;
   merge mycas.dictionary(IN=A)
         mycas.outpos(IN=B rename=(Parent=word));
   by word;
   if B AND A;
run;

/* Count word frequency by document in the filtered output */
data mycas.WORD_BY_DOC_COUNT/sessref=mysession;
   set mycas.outpos_filtered;
   by DOCUMENT;
   if first.DOCUMENT then dict_word_count = 0;
   dict_word_count + 1;
   if last.DOCUMENT then output;
   keep document dict_word_count;
   rename document = id;
run;

/* Merge word frequency information back to the original text dataset */
data mycas.TEXT_DATASET_WITH_COUNT/sessref=mysession;
   merge mycas.TEXT_DATASET(IN=A)
         mycas.WORD_BY_DOC_COUNT(IN=B);	
   by id;
   if A;
   if dict_word_count = . then dict_word_count = 0;
   keep text dict_word_count;
run;

View solution in original post

s_lassen · Posted 11-20-2023 12:12 PM

You have a lot better chance at getting an answer if you show your sample data as datastep code, like this:

DATA dictionary;
input word $;
cards;
apple
banana
orange
grape
;run;

data TEXT_DATASET;
  infile cards truncover;
  input text $char200.;
cards;
This is an apple and a banana
I love eating oranges.
Grapes are delicious, especially red grapes.
I have an apple and an orange for lunch.
;run;

The simplest way to get the answer you want is this:

data want;
  set text_dataset;
  dict_word_count=0;
  do _N_=1 to words;
    set dictionary point=_N_ nobs=words;
    dict_word_count+find(text,trim(word),'i',1)>0;
    end;
  drop word;
run;

- simply reading the "dictionary" in every iteration of the text dataset, and look for the text string.

There is a small problem with your example, however: you apparently want a positive outcome for both "grape" and "grapes" when the search word is "grape" - which is why I used the FIND function and not the FINDW function (which searches for words and not just text strings) - but I suppose that you do not want a positive outcome for words like "grapevine", "grapefruit", "grapejuice" or "gangrape" (interestingly, but totally irrelevantly, my spelling control accepts "grapevine", "grapefruit" and "gangrape", but not "grapejuice" - who wrote that?)

If all your words are nouns, that may or may not have an "s" at the end, one possible solution is to use PRX - a PRX expression like '/\bgrapes?\b/i' will look for the whole words "grape" and "grapes" but no other words containing "grape".

A solution for this can be written as

data want;
  set text_dataset;
  if _N_=1 then do _N_=1 to words;
    array prxids (1:1000) 8 _temporary_;
    set dictionary point=_N_ nobs=words;
    prxids(_N_)=prxparse(cats('/\b',word,'s?\b/i'));
    end;
  dict_word_count=0;
  do _N_=1 to words;
    dict_word_count+prxmatch(prxids(_N_),text)>0;
    end;
  drop word;
run;

Another possibility is to redo your dictionary, with all relevant forms of the words you are looking for (if only nouns, then it is "apple" and "apples", "die" and "dice", "child" and "children" etc.) and then use the first solution suggested, but with the FINDW function instead of FIND.

alisio_meneses · Posted 11-20-2023 03:38 PM

Hi @s_lassen , thank you very much.

Your sample code performs as expected and functions well with a temporary ('work') library. However, I encounter an error (log attached) when attempting to change your code for use with CAS tables using "Public" Caslib, specifically the 'dictionary' table.

ERROR: The POINT= data set option is not valid for the data set PUBLIC.DICTIONARY, the data set must be randomly not sequentially
accessible for POINT= processing.

Here the code slightly altered using CAS Tables:

DATA public.dictionary;
input word $;
cards;
apple
banana
orange
grape
;run;

data public.TEXT_DATASET;
  infile cards truncover;
  input text $char200.;
cards;
This is an apple and a banana
I love eating oranges.
Grapes are delicious, especially red grapes.
I have an apple and an orange for lunch.
;run;

data public.want;
  set public.text_dataset;
  if _N_=1 then do _N_=1 to words;
    array prxids (1:1000) 8 _temporary_;
    set public.dictionary point=_N_ nobs=words;
    prxids(_N_)=prxparse(cats('/\b',word,'s?\b/i'));
    end;
  dict_word_count=0;
  do _N_=1 to words;
    dict_word_count+prxmatch(prxids(_N_),text)>0;
    end;
  drop word;
run;

Is there any additional step required for use in with CAS?

PS. : thanks for the tip regarding sample data as datastep code.

Patrick · Posted 11-20-2023 04:28 PM

@alisio_meneses CAS processes data in parallel and ideally spread over multiple worker nodes. Sequential data access doesn't work anymore (point=, first., last.) and we need to come-up with new tricks.

I haven't had an idea yet how to load this temporary array not using point=.

You could consider using a hash/hiter object. You then would need to loop over the full hiter because your words aren't always an exact match.

Have a look into this article.

...and I also came to realize that I don't really understand how hash tables work with multiple worker nodes. Will the hash table get loaded in full per worker or does using a hash table mean processing gets pushed to the controller node ...or something else?

Patrick · Posted 11-20-2023 08:41 PM

Here some tested datastep code that works with tables in CAS

cas mysess cassessopts=(caslib="casuser");

libname casuser cas;

data casuser.dictionary; 
 input word $; 
 cards; 
apple 
banana 
orange 
grape 
; 
run; 

data casuser.TEXT_DATASET; 
 infile cards truncover; 
 input text $char200.; 
 cards; 
This is an apple and a banana 
I love eating oranges. 
Grapes are delicious, especially red grapes. 
I have an apple and an orange for lunch. 
; 
run; 

data _null_;
  set casuser.dictionary end=last;
  length def_regex $32767;
  retain def_regex;
  def_regex=cats(def_regex,"prxids[",_n_,"]=","prxparse(","'/\b",word,"s?\b/i'",");" );
  if last then 
    do;
      call symputx('def_regex',def_regex);
      call symputx('n_words',_n_);
    end;
run;
%put def_regex: %nrbquote(&def_regex);

data casuser.want;
  set casuser.text_dataset;
  array prxids {&n_words} 8 _temporary_ ;
  &def_regex;
  do _n_=1 to &n_words;
    dict_word_count=sum(dict_word_count,prxmatch(prxids(_n_),text)>0);
  end;
run;

proc print data=casuser.want;
run;

The data _null_ step still pulls the data from CAS to Compute. I haven't figured out yet how to keep all of the processing in CAS.

I'll post such an approach if I can figure it out - thinking about Proc CAS using array datatypes. I've never done that so thanks for sharing a problem that provides me with a use case to learn something new.

Patrick · Posted 11-25-2023 03:14 AM

@alisio_meneses

Below a coding option that doesn't move data to compute.

I wouldn't necessarily implement this way but to make it work was a very valuable learning lesson for me.

options msglevel=i;
cas mysess cassessopts=(caslib="casuser");

libname casuser cas;

data casuser.lookup; 
 input word $; 
 cards; 
apple 
banana 
orange 
grape 
; 
run; 

data casuser.text_dataset; 
 infile cards truncover; 
 input text $char200.; 
 cards; 
This is an apple and a banana 
I love eating oranges. 
Grapes are delicious, especially red grapes. 
I have an apple and an orange for lunch. 
Pineapples are delicious
; 
run; 

proc cas;
  session mysess;
  action table.fetch result=tbl / 
    table={name="lookup" vars={{name="word"}} } 
    ;

  wordarr=getcolumn(findtable(tbl),"word");
  
  prx_def=' ';
  n_terms=dim(wordarr);
  do i=1 to n_terms;
    prx_def=cats(prx_def,cats("prxids[",i,"]=","prxparse(","'/\b",wordarr[i],"s?\b/i'",");" ));
  end;

  action datastep.runCode /
    code=
      'data casuser.want;' ||
      '  set casuser.text_dataset;' ||
      '  array prxids {'|| n_terms ||'} 8 _temporary_ ;' ||
      prx_def ||
      '  do _n_=1 to '|| n_terms ||';' ||
      '    dict_word_count=sum(dict_word_count,prxmatch(prxids(_n_),text)>0);' ||
      '  end;' ||
      'run;' 
      ;

quit;

proc print data=casuser.want;
run;

cas mysess terminate;

s_lassen · Posted 11-20-2023 11:47 PM

I put the POINT= in the second (PRX) example out of "habit" (copying some code from the first example), it is not necessary here, so you could probably get away with something like

data public.want;
  set public.text_dataset;
  if _N_=1 then do _N_=1 to words;
    array prxids (1:1000) 8 _temporary_;
    set public.dictionary  nobs=words;
    prxids(_N_)=prxparse(cats('/\b',word,'s?\b/i'));
    end;
  dict_word_count=0;
  do _N_=1 to words;
    dict_word_count+prxmatch(prxids(_N_),text)>0;
    end;
  drop word;
run;

Patrick · Posted 11-21-2023 12:51 AM

@s_lassen To my surprise the code you've posted using the nobs dataset option works with CAS tables as source and target BUT it leads to processing under compute meaning: data transfer CAS to compute, single threaded execution under compute, data transfer compute to CAS. If you're working with bigger data volumes then I'd expect a hefty impact on performance for such a process.

Running your code here what the SAS log tells me

NOTE: The NOBS option on the SET statement is not supported with DATA step in Cloud Analytic Services.
NOTE: Could not execute DATA step code in Cloud Analytic Services. Running DATA step in the SAS client.

Here what SAS tells me with the code I've posted earlier where the data want step executes fully in CAS.

118  data casuser.want;
119    set casuser.text_dataset;
120    array prxids {&n_words} 8 _temporary_ ;
121    &def_regex;
122    do _n_=1 to &n_words;
123      dict_word_count=sum(dict_word_count,prxmatch(prxids(_n_),text)>0);
124    end;
125  run;
NOTE: Running DATA step in Cloud Analytic Services.
NOTE: The DATA step will run in multiple threads.

Ksharp · Posted 11-21-2023 04:47 AM

I think you should use findw() instead of find().Otherwise , you would get wrong number in some scenario.

andreas_lds · Posted 11-21-2023 02:09 AM

What do you expect as count from a text like "Pineapples are delicious" ?

alisio_meneses · Posted 11-26-2023 07:23 AM

Assuming 'pineapple' is in the table 'dictionary', I would expect a count of 1

mkeintz · Posted 11-26-2023 12:21 AM

All the solutions present so far loop through the dictionary looking for matches of the dictionary term (or its plural) in text. This is a good way to avoid double counting.

But what if the dictionary is large, say 20,000 terms, while each text line has maybe an average of ten words? Loop through 20,000 items, whether through FINDW, or via regular-expressions, can be very expensive, given that only an extremely small share of those dictionary items will be found in the text.

A more efficient alternative in this case would be to loop through all the words in the text, checking (via a hash object) whether they are in the dictionary. But unlike looping through the dictionary, a little extra effort is needed to avoid double counting. It involves:

expanding the dictionary (two columns instead of one in my example below, and twice as many rows)
and
if a word or word-variation is found, modify the text so that subsequent entries of the same word (or variations thereof) do not precipitate an addition to dict_word_count.

The below is a working example, and would be much faster with large dictionaries.

DATA dictionary;
length word $10;
 input word   @@;
cards;
apple   banana   orange   grape
run;

data TEXT_DATASET;
  infile cards truncover;
  input text $char200.;
cards;
This is an apple and a banana
I love eating oranges.
Grapes are delicious, especially red grapes.
I have an apple and an orange for lunch.
run;

/*Add variations to the dictionary words*/
data expanded_dictionary (keep=_word1 _word2) / view=expanded_dictionary;
  length _word1 _word2 $11;
  set dictionary;
  _word2=word;
  _word1=cats(word,'s');
  output;
  call sortc(_word1,_word2);  *Make _word1 the singular and _word2 have the trailing "s"**;
  output;
run;


data want (drop=_:);
  set text_dataset  expanded_dictionary (obs=0);
  if _n_=1 then do;
    declare hash h (dataset:'expanded_dictionary');
      h.definekey('_word1');
      h.definedata(all:'Y');
      h.definedone();
  end;
  _text=lowcase(text);
  dict_word_count=0;

  do _w=1 to countw(_text);         **For every word in _TEXT ... **;
    _word1=scan(_text,_w);          **Extract it **;
    if h.find()=0 then do;          **If it's in the dictionary then ... **;
      dict_word_count=dict_word_count+1;
      do _word=_word1,_word2;           **For every variation of the base word ...  **;
        do _f= findw(_text,trim(_word)) by 0 while(_f>0);
          substr(_text,_f,1)='!';       **Replace leading character with a '!' **;
          _f= findw(_text,trim(_word));
        end;
      end;
    end;
  end;
run;

In expanded_dictionary, the length of _word1 and _word2 is one longer than the length of word, to support appending a trailing "s". One could make much fancier dictionaries to allow for more than the simple trailing "s" variation used here.

The "substr()" function can be put on the left side of an assignment statement (as in substr(_text,_f,1)='!' above). This changes the leading character of every instance of each dictionary word (or variation) in _text to a '!' - yielding words like "!anana" - which will not be found in the dictionary.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

alisio_meneses · Posted 12-15-2023 06:36 AM

I've managed to accomplish the task using PROC TEXTMINE. Apart from completing the specified task, it enables the use of other features such as stemming, entity detection, and so on.

The following code does the trick and, aside from the sample tables code, it runs in CAS.

Thank you all for helping!

Sample tables:

/* dictionary words*/
DATA mycas.dictionary;
   length word varchar(*);
   input word;
   cards;
   apple
   banana
   orange
   grape
   ;
run;

/* text and associated IDs */
data mycas.TEXT_DATASET;
   infile datalines delimiter='|' missover;
   length text varchar(*);
   input text$ id;
   datalines;
   This is an apple and a banana|1
   I love eating an oranges for lunch|2
   Grapes are delicious, especially red grapes|3
   I have an apple and an orange for lunch|4
   ;
run;

And the word count script.

/*************************************************************
* Count dictionary matches in text
* Author: Antonio Alisio de Meneses Cordeiro
* Date: 10/12/2023
* Description: This SAS program performs text mining on a dataset
*              containing textual information. It includes the
*              creation of a dictionary, parsing of text data,
*              and counting the frequency of dictionary words
*              in each document.
*************************************************************/

/* Connect to CAS session */
cas mysession;

/* Assign a library reference to the CAS session */
libname mycas cas sessref=mysession;

/* Perform text mining on the 'mycas.TEXT_DATASET' table */
proc textmine data=mycas.TEXT_DATASET;
   doc_id id;
   var text;
   parse
      notagging nonoungroups
      termwgt        = none
      cellwgt        = none
      reducef        = 1
      entities       = none
      outpos          = mycas.outpos
      ;
run;

/* Create a new table store parsed text information */
data mycas.outpos/sessref=mysession;
   set mycas.outpos;
   format Term;
   format Parent;
   format DOCUMENT;
run;

/* Filter out only the words in the dictionary from the parsed text */
data mycas.outpos_filtered;
   merge mycas.dictionary(IN=A)
         mycas.outpos(IN=B rename=(Parent=word));
   by word;
   if B AND A;
run;

/* Count word frequency by document in the filtered output */
data mycas.WORD_BY_DOC_COUNT/sessref=mysession;
   set mycas.outpos_filtered;
   by DOCUMENT;
   if first.DOCUMENT then dict_word_count = 0;
   dict_word_count + 1;
   if last.DOCUMENT then output;
   keep document dict_word_count;
   rename document = id;
run;

/* Merge word frequency information back to the original text dataset */
data mycas.TEXT_DATASET_WITH_COUNT/sessref=mysession;
   merge mycas.TEXT_DATASET(IN=A)
         mycas.WORD_BY_DOC_COUNT(IN=B);	
   by id;
   if A;
   if dict_word_count = . then dict_word_count = 0;
   keep text dict_word_count;
run;

Classroom Training Available!