- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Let's say I have the following table :
dog is red |
I love that red dog |
hot dog |
red cat |
that cat hates a red dog |
I'd like to have a count of occurences of every two word groups, something that would look like this :
two words group | count |
red+dog | 3 |
that+dog | 2 |
red+cat | 2 |
I have managed to cout occurences of each word, thanks to another thread, but I have no idea how to do this.
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
One issue to consider: what if the incoming string repeats a word? Should both instances of the same word be used, doubling the number of pairs?
The approach below counts double, when the word appears twice. You can program around that, so we can come back to that if necessary.
data pairs;
set have;
if countw(string) > 1;
obsno = _n_;
do j=1 to countw(string) - 1;
do k=j+1 to countw(string);
word1 = scan(string, j);
word2 = scan(string, k);
if word1 > word2 then do;
temp = word2;
word2 = word1;
word1 = temp;
end;
output;
end;
end;
keep word1 word2 obsno;
run;
proc sort data=pairs nodupkey;
by obsno word1 word2;
run;
proc freq data=pairs;
tables word1 * word2 / list;
run;
*** EDITED to remove duplicate pairs within the same observation.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Do you want to set those two word groups yourself or a list of all possible two word groups from the sentences above?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
One issue to consider: what if the incoming string repeats a word? Should both instances of the same word be used, doubling the number of pairs?
The approach below counts double, when the word appears twice. You can program around that, so we can come back to that if necessary.
data pairs;
set have;
if countw(string) > 1;
obsno = _n_;
do j=1 to countw(string) - 1;
do k=j+1 to countw(string);
word1 = scan(string, j);
word2 = scan(string, k);
if word1 > word2 then do;
temp = word2;
word2 = word1;
word1 = temp;
end;
output;
end;
end;
keep word1 word2 obsno;
run;
proc sort data=pairs nodupkey;
by obsno word1 word2;
run;
proc freq data=pairs;
tables word1 * word2 / list;
run;
*** EDITED to remove duplicate pairs within the same observation.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Astounding wrote:
One issue to consider: what if the incoming string repeats a word? Should both instances of the same word be used, doubling the number of pairs?
Theoretically, I would not want both instances of the same word to be used. But as a matter of fact, there is no occurrences where the string repeats a word in my table, except for small linking words (like "of" for example). I just have to ignore all these small words with a quick data step. So your code works great for me ! Thank you.
For other users though, it might be useful to find a solution to this issue.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
if word1 > word2 then do;
temp = word2;
word2 = word1;
word1 = temp;
It works great, but could you please explain the meaning of this part of the code in the loop ?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
To understand the purpose of that DO group, consider this sentence:
it makes it difficult
Without that code, both of these pairs would be output:
it makes (the first and second words)
makes it (the second and third words)
Sorting with NODUPKEY would still fail to remove either of them. So that DO group switches the value of WORD1 and WORD2, and makes WORD1 the word that alphabetizes first (and WORD2 the word that alphabetizes second). So the result comes out as:
it makes
it makes
Then NODUPKEY finds the duplicates and removes one.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Are there specific words you're looking for or all two word combinations, not including duplicates? It seems the order doesn't matter, which is slightly different than an n-gram? Are you including words such as I, the is (articles and conjunctions)?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I'll leave the articles and conjuctions part up to you.
Here's one approach. I split the text into individual words, and then use SQL to find all two word combinations.
Note that this method makes it easy to find any n-gram or join with other lookup tables for sentiment analysis type work.
*Create sample data;
data random_sentences;
infile cards truncover;
informat sentence $256.;
input sentence $256.;
cards;
This is a random sentence
This is another random sentence
Happy Birthday
My job sucks.
This is a good idea, not.
This is an awesome idea!
How are you today?
Does this make sense?
Have a great day!
;
;
;
;
*Partition into words;
data f1;
set random_sentences;
id=_n_;
nwords=countw(sentence);
nchar=length(compress(sentence));
do word_order=1 to nwords;
word=scan(sentence, word_order);
output;
end;
run;
proc sql;
create table words2 as
select t1.sentence, lowcase(t1.word) as word1, lowcase(t2.word) as word2
from f1 as t1
cross join f1 as t2
where t1.sentence=t2.sentence
and t1.word_order > t2.word_order
order by t1.sentence, t1.word_order;
quit;
proc freq data=words2 noprint order=freq;
table word1*word2 /list out=want;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
data have;
input string:$50.;
infile datalines dlm=',';
datalines;
dog is red
I love that red dog
hot dog
red cat
that cat hates a red dog
;
data help(keep=word);
set have;
nWords=countw(string);
do i=1 to nWords;
word=scan(string, i);
output;
end;
run;
proc sort data=help nodupkey;
by word;
run;
proc transpose data=help out=help_wide(drop=_NAME_) prefix=word;
var word;
run;
data help2(keep=word1 word2);
set help_wide;
array words{*} $ word1-word10;
ncomb=comb(dim(words), 2);
do i=1 to ncomb;
twoWords=allcomb(i, 2, of words[*]);
output;
end;
run;
proc sql;
create table want as
select a.*
,sum(case when (findw(strip(string), strip(word1)) &
findw(strip(string), strip(word2))) then 1
else 0 end) as Count
from help2 as a, have as b
group by word1, word2
having calculated count > 0
order by calculated Count desc;
run;