Hi,
Let's say I have the following table :
dog is red |
I love that red dog |
hot dog |
red cat |
that cat hates a red dog |
I'd like to have a count of occurences of every two word groups, something that would look like this :
two words group | count |
red+dog | 3 |
that+dog | 2 |
red+cat | 2 |
I have managed to cout occurences of each word, thanks to another thread, but I have no idea how to do this.
One issue to consider: what if the incoming string repeats a word? Should both instances of the same word be used, doubling the number of pairs?
The approach below counts double, when the word appears twice. You can program around that, so we can come back to that if necessary.
data pairs;
set have;
if countw(string) > 1;
obsno = _n_;
do j=1 to countw(string) - 1;
do k=j+1 to countw(string);
word1 = scan(string, j);
word2 = scan(string, k);
if word1 > word2 then do;
temp = word2;
word2 = word1;
word1 = temp;
end;
output;
end;
end;
keep word1 word2 obsno;
run;
proc sort data=pairs nodupkey;
by obsno word1 word2;
run;
proc freq data=pairs;
tables word1 * word2 / list;
run;
*** EDITED to remove duplicate pairs within the same observation.
Do you want to set those two word groups yourself or a list of all possible two word groups from the sentences above?
One issue to consider: what if the incoming string repeats a word? Should both instances of the same word be used, doubling the number of pairs?
The approach below counts double, when the word appears twice. You can program around that, so we can come back to that if necessary.
data pairs;
set have;
if countw(string) > 1;
obsno = _n_;
do j=1 to countw(string) - 1;
do k=j+1 to countw(string);
word1 = scan(string, j);
word2 = scan(string, k);
if word1 > word2 then do;
temp = word2;
word2 = word1;
word1 = temp;
end;
output;
end;
end;
keep word1 word2 obsno;
run;
proc sort data=pairs nodupkey;
by obsno word1 word2;
run;
proc freq data=pairs;
tables word1 * word2 / list;
run;
*** EDITED to remove duplicate pairs within the same observation.
Astounding wrote:
One issue to consider: what if the incoming string repeats a word? Should both instances of the same word be used, doubling the number of pairs?
Theoretically, I would not want both instances of the same word to be used. But as a matter of fact, there is no occurrences where the string repeats a word in my table, except for small linking words (like "of" for example). I just have to ignore all these small words with a quick data step. So your code works great for me ! Thank you.
For other users though, it might be useful to find a solution to this issue.
if word1 > word2 then do;
temp = word2;
word2 = word1;
word1 = temp;
It works great, but could you please explain the meaning of this part of the code in the loop ?
To understand the purpose of that DO group, consider this sentence:
it makes it difficult
Without that code, both of these pairs would be output:
it makes (the first and second words)
makes it (the second and third words)
Sorting with NODUPKEY would still fail to remove either of them. So that DO group switches the value of WORD1 and WORD2, and makes WORD1 the word that alphabetizes first (and WORD2 the word that alphabetizes second). So the result comes out as:
it makes
it makes
Then NODUPKEY finds the duplicates and removes one.
Are there specific words you're looking for or all two word combinations, not including duplicates? It seems the order doesn't matter, which is slightly different than an n-gram? Are you including words such as I, the is (articles and conjunctions)?
I'll leave the articles and conjuctions part up to you.
Here's one approach. I split the text into individual words, and then use SQL to find all two word combinations.
Note that this method makes it easy to find any n-gram or join with other lookup tables for sentiment analysis type work.
*Create sample data;
data random_sentences;
infile cards truncover;
informat sentence $256.;
input sentence $256.;
cards;
This is a random sentence
This is another random sentence
Happy Birthday
My job sucks.
This is a good idea, not.
This is an awesome idea!
How are you today?
Does this make sense?
Have a great day!
;
;
;
;
*Partition into words;
data f1;
set random_sentences;
id=_n_;
nwords=countw(sentence);
nchar=length(compress(sentence));
do word_order=1 to nwords;
word=scan(sentence, word_order);
output;
end;
run;
proc sql;
create table words2 as
select t1.sentence, lowcase(t1.word) as word1, lowcase(t2.word) as word2
from f1 as t1
cross join f1 as t2
where t1.sentence=t2.sentence
and t1.word_order > t2.word_order
order by t1.sentence, t1.word_order;
quit;
proc freq data=words2 noprint order=freq;
table word1*word2 /list out=want;
run;
data have;
input string:$50.;
infile datalines dlm=',';
datalines;
dog is red
I love that red dog
hot dog
red cat
that cat hates a red dog
;
data help(keep=word);
set have;
nWords=countw(string);
do i=1 to nWords;
word=scan(string, i);
output;
end;
run;
proc sort data=help nodupkey;
by word;
run;
proc transpose data=help out=help_wide(drop=_NAME_) prefix=word;
var word;
run;
data help2(keep=word1 word2);
set help_wide;
array words{*} $ word1-word10;
ncomb=comb(dim(words), 2);
do i=1 to ncomb;
twoWords=allcomb(i, 2, of words[*]);
output;
end;
run;
proc sql;
create table want as
select a.*
,sum(case when (findw(strip(string), strip(word1)) &
findw(strip(string), strip(word2))) then 1
else 0 end) as Count
from help2 as a, have as b
group by word1, word2
having calculated count > 0
order by calculated Count desc;
run;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.