Hello there!
I am struggling with a simple task: I have two tables. The first one is a word dictionary named ''DICTIONARY", containing a single field named "word." The second table is called TEXT_DATASET, which includes a single field named 'text,' where each row consists of text. Using SAS, how to add a field to the TEXT_DATASET
table, named 'dict_word_count,' which will store the count of words from the DICTIONARY table that exist in each 'text' field of the TEXT_DATASET table?
The following tables are samples:
Table DICTIONARY:
word |
apple |
banana |
orange |
grape |
Table TEXT_DATASET:
text |
This is an apple and a banana |
I love eating oranges. |
Grapes are delicious, especially red grapes. |
I have an apple and an orange for lunch. |
Resulting table TEXT_DATASET_WITH_COUNT:
text | dict_word_count |
This is an apple and a banana |
2 |
I love eating oranges. |
1 |
Grapes are delicious, especially red grapes. |
1 |
I have an apple and an orange for lunch. | 2 |
Any ideas?
Thank you!
You have a lot better chance at getting an answer if you show your sample data as datastep code, like this:
DATA dictionary;
input word $;
cards;
apple
banana
orange
grape
;run;
data TEXT_DATASET;
infile cards truncover;
input text $char200.;
cards;
This is an apple and a banana
I love eating oranges.
Grapes are delicious, especially red grapes.
I have an apple and an orange for lunch.
;run;
The simplest way to get the answer you want is this:
data want;
set text_dataset;
dict_word_count=0;
do _N_=1 to words;
set dictionary point=_N_ nobs=words;
dict_word_count+find(text,trim(word),'i',1)>0;
end;
drop word;
run;
- simply reading the "dictionary" in every iteration of the text dataset, and look for the text string.
There is a small problem with your example, however: you apparently want a positive outcome for both "grape" and "grapes" when the search word is "grape" - which is why I used the FIND function and not the FINDW function (which searches for words and not just text strings) - but I suppose that you do not want a positive outcome for words like "grapevine", "grapefruit", "grapejuice" or "gangrape" (interestingly, but totally irrelevantly, my spelling control accepts "grapevine", "grapefruit" and "gangrape", but not "grapejuice" - who wrote that?)
If all your words are nouns, that may or may not have an "s" at the end, one possible solution is to use PRX - a PRX expression like '/\bgrapes?\b/i' will look for the whole words "grape" and "grapes" but no other words containing "grape".
A solution for this can be written as
data want;
set text_dataset;
if _N_=1 then do _N_=1 to words;
array prxids (1:1000) 8 _temporary_;
set dictionary point=_N_ nobs=words;
prxids(_N_)=prxparse(cats('/\b',word,'s?\b/i'));
end;
dict_word_count=0;
do _N_=1 to words;
dict_word_count+prxmatch(prxids(_N_),text)>0;
end;
drop word;
run;
Another possibility is to redo your dictionary, with all relevant forms of the words you are looking for (if only nouns, then it is "apple" and "apples", "die" and "dice", "child" and "children" etc.) and then use the first solution suggested, but with the FINDW function instead of FIND.
Hi @s_lassen , thank you very much.
Your sample code performs as expected and functions well with a temporary ('work') library. However, I encounter an error (log attached) when attempting to change your code for use with CAS tables using "Public" Caslib, specifically the 'dictionary' table.
ERROR: The POINT= data set option is not valid for the data set PUBLIC.DICTIONARY, the data set must be randomly not sequentially accessible for POINT= processing.
Here the code slightly altered using CAS Tables:
DATA public.dictionary; input word $; cards; apple banana orange grape ;run; data public.TEXT_DATASET; infile cards truncover; input text $char200.; cards; This is an apple and a banana I love eating oranges. Grapes are delicious, especially red grapes. I have an apple and an orange for lunch. ;run; data public.want; set public.text_dataset; if _N_=1 then do _N_=1 to words; array prxids (1:1000) 8 _temporary_; set public.dictionary point=_N_ nobs=words; prxids(_N_)=prxparse(cats('/\b',word,'s?\b/i')); end; dict_word_count=0; do _N_=1 to words; dict_word_count+prxmatch(prxids(_N_),text)>0; end; drop word; run;
Is there any additional step required for use in with CAS?
PS. : thanks for the tip regarding sample data as datastep code.
@alisio_meneses CAS processes data in parallel and ideally spread over multiple worker nodes. Sequential data access doesn't work anymore (point=, first., last.) and we need to come-up with new tricks.
I haven't had an idea yet how to load this temporary array not using point=.
You could consider using a hash/hiter object. You then would need to loop over the full hiter because your words aren't always an exact match.
Have a look into this article.
...and I also came to realize that I don't really understand how hash tables work with multiple worker nodes. Will the hash table get loaded in full per worker or does using a hash table mean processing gets pushed to the controller node ...or something else?
Here some tested datastep code that works with tables in CAS
cas mysess cassessopts=(caslib="casuser");
libname casuser cas;
data casuser.dictionary;
input word $;
cards;
apple
banana
orange
grape
;
run;
data casuser.TEXT_DATASET;
infile cards truncover;
input text $char200.;
cards;
This is an apple and a banana
I love eating oranges.
Grapes are delicious, especially red grapes.
I have an apple and an orange for lunch.
;
run;
data _null_;
set casuser.dictionary end=last;
length def_regex $32767;
retain def_regex;
def_regex=cats(def_regex,"prxids[",_n_,"]=","prxparse(","'/\b",word,"s?\b/i'",");" );
if last then
do;
call symputx('def_regex',def_regex);
call symputx('n_words',_n_);
end;
run;
%put def_regex: %nrbquote(&def_regex);
data casuser.want;
set casuser.text_dataset;
array prxids {&n_words} 8 _temporary_ ;
&def_regex;
do _n_=1 to &n_words;
dict_word_count=sum(dict_word_count,prxmatch(prxids(_n_),text)>0);
end;
run;
proc print data=casuser.want;
run;
The data _null_ step still pulls the data from CAS to Compute. I haven't figured out yet how to keep all of the processing in CAS.
I'll post such an approach if I can figure it out - thinking about Proc CAS using array datatypes. I've never done that so thanks for sharing a problem that provides me with a use case to learn something new.
Below a coding option that doesn't move data to compute.
I wouldn't necessarily implement this way but to make it work was a very valuable learning lesson for me.
options msglevel=i;
cas mysess cassessopts=(caslib="casuser");
libname casuser cas;
data casuser.lookup;
input word $;
cards;
apple
banana
orange
grape
;
run;
data casuser.text_dataset;
infile cards truncover;
input text $char200.;
cards;
This is an apple and a banana
I love eating oranges.
Grapes are delicious, especially red grapes.
I have an apple and an orange for lunch.
Pineapples are delicious
;
run;
proc cas;
session mysess;
action table.fetch result=tbl /
table={name="lookup" vars={{name="word"}} }
;
wordarr=getcolumn(findtable(tbl),"word");
prx_def=' ';
n_terms=dim(wordarr);
do i=1 to n_terms;
prx_def=cats(prx_def,cats("prxids[",i,"]=","prxparse(","'/\b",wordarr[i],"s?\b/i'",");" ));
end;
action datastep.runCode /
code=
'data casuser.want;' ||
' set casuser.text_dataset;' ||
' array prxids {'|| n_terms ||'} 8 _temporary_ ;' ||
prx_def ||
' do _n_=1 to '|| n_terms ||';' ||
' dict_word_count=sum(dict_word_count,prxmatch(prxids(_n_),text)>0);' ||
' end;' ||
'run;'
;
quit;
proc print data=casuser.want;
run;
cas mysess terminate;
I put the POINT= in the second (PRX) example out of "habit" (copying some code from the first example), it is not necessary here, so you could probably get away with something like
data public.want;
set public.text_dataset;
if _N_=1 then do _N_=1 to words;
array prxids (1:1000) 8 _temporary_;
set public.dictionary nobs=words;
prxids(_N_)=prxparse(cats('/\b',word,'s?\b/i'));
end;
dict_word_count=0;
do _N_=1 to words;
dict_word_count+prxmatch(prxids(_N_),text)>0;
end;
drop word;
run;
@s_lassen To my surprise the code you've posted using the nobs dataset option works with CAS tables as source and target BUT it leads to processing under compute meaning: data transfer CAS to compute, single threaded execution under compute, data transfer compute to CAS. If you're working with bigger data volumes then I'd expect a hefty impact on performance for such a process.
Running your code here what the SAS log tells me
NOTE: The NOBS option on the SET statement is not supported with DATA step in Cloud Analytic Services. NOTE: Could not execute DATA step code in Cloud Analytic Services. Running DATA step in the SAS client.
Here what SAS tells me with the code I've posted earlier where the data want step executes fully in CAS.
118 data casuser.want; 119 set casuser.text_dataset; 120 array prxids {&n_words} 8 _temporary_ ; 121 &def_regex; 122 do _n_=1 to &n_words; 123 dict_word_count=sum(dict_word_count,prxmatch(prxids(_n_),text)>0); 124 end; 125 run; NOTE: Running DATA step in Cloud Analytic Services. NOTE: The DATA step will run in multiple threads.
What do you expect as count from a text like "Pineapples are delicious" ?
Assuming 'pineapple' is in the table 'dictionary', I would expect a count of 1
All the solutions present so far loop through the dictionary looking for matches of the dictionary term (or its plural) in text. This is a good way to avoid double counting.
But what if the dictionary is large, say 20,000 terms, while each text line has maybe an average of ten words? Loop through 20,000 items, whether through FINDW, or via regular-expressions, can be very expensive, given that only an extremely small share of those dictionary items will be found in the text.
A more efficient alternative in this case would be to loop through all the words in the text, checking (via a hash object) whether they are in the dictionary. But unlike looping through the dictionary, a little extra effort is needed to avoid double counting. It involves:
The below is a working example, and would be much faster with large dictionaries.
DATA dictionary;
length word $10;
input word @@;
cards;
apple banana orange grape
run;
data TEXT_DATASET;
infile cards truncover;
input text $char200.;
cards;
This is an apple and a banana
I love eating oranges.
Grapes are delicious, especially red grapes.
I have an apple and an orange for lunch.
run;
/*Add variations to the dictionary words*/
data expanded_dictionary (keep=_word1 _word2) / view=expanded_dictionary;
length _word1 _word2 $11;
set dictionary;
_word2=word;
_word1=cats(word,'s');
output;
call sortc(_word1,_word2); *Make _word1 the singular and _word2 have the trailing "s"**;
output;
run;
data want (drop=_:);
set text_dataset expanded_dictionary (obs=0);
if _n_=1 then do;
declare hash h (dataset:'expanded_dictionary');
h.definekey('_word1');
h.definedata(all:'Y');
h.definedone();
end;
_text=lowcase(text);
dict_word_count=0;
do _w=1 to countw(_text); **For every word in _TEXT ... **;
_word1=scan(_text,_w); **Extract it **;
if h.find()=0 then do; **If it's in the dictionary then ... **;
dict_word_count=dict_word_count+1;
do _word=_word1,_word2; **For every variation of the base word ... **;
do _f= findw(_text,trim(_word)) by 0 while(_f>0);
substr(_text,_f,1)='!'; **Replace leading character with a '!' **;
_f= findw(_text,trim(_word));
end;
end;
end;
end;
run;
In expanded_dictionary, the length of _word1 and _word2 is one longer than the length of word, to support appending a trailing "s". One could make much fancier dictionaries to allow for more than the simple trailing "s" variation used here.
The "substr()" function can be put on the left side of an assignment statement (as in substr(_text,_f,1)='!' above). This changes the leading character of every instance of each dictionary word (or variation) in _text to a '!' - yielding words like "!anana" - which will not be found in the dictionary.
Registration is open! SAS is returning to Vegas for an AI and analytics experience like no other! Whether you're an executive, manager, end user or SAS partner, SAS Innovate is designed for everyone on your team. Register for just $495 by 12/31/2023.
If you are interested in speaking, there is still time to submit a session idea. More details are posted on the website.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.