Solved: Re: count occurences of multiple words

alex_philby · Posted 12-20-2017 09:43 AM

Hi,

Let's say I have the following table :

dog is red

I love that red dog

hot dog

red cat

that cat hates a red dog

I'd like to have a count of occurences of every two word groups, something that would look like this :

two words group	count
red+dog	3
that+dog	2
red+cat	2

I have managed to cout occurences of each word, thanks to another thread, but I have no idea how to do this.

Astounding · Posted 12-20-2017 10:04 AM

One issue to consider: what if the incoming string repeats a word? Should both instances of the same word be used, doubling the number of pairs?

The approach below counts double, when the word appears twice. You can program around that, so we can come back to that if necessary.

data pairs;

set have;

if countw(string) > 1;

obsno = _n_;

do j=1 to countw(string) - 1;

do k=j+1 to countw(string);

word1 = scan(string, j);

word2 = scan(string, k);

if word1 > word2 then do;

temp = word2;

word2 = word1;

word1 = temp;

end;

output;

end;

keep word1 word2 obsno;

run;

proc sort data=pairs nodupkey;

by obsno word1 word2;

run;

proc freq data=pairs;

tables word1 * word2 / list;

run;

*** EDITED to remove duplicate pairs within the same observation.

View solution in original post

PeterClemmensen · Posted 12-20-2017 09:59 AM

Do you want to set those two word groups yourself or a list of all possible two word groups from the sentences above?

The DATA to DATA Step Macro
Blog: SASnrd

alex_philby · Posted 12-20-2017 10:45 AM

I want a list of every possible two words groups.

Astounding · Posted 12-20-2017 10:04 AM

One issue to consider: what if the incoming string repeats a word? Should both instances of the same word be used, doubling the number of pairs?

The approach below counts double, when the word appears twice. You can program around that, so we can come back to that if necessary.

data pairs;

set have;

if countw(string) > 1;

obsno = _n_;

do j=1 to countw(string) - 1;

do k=j+1 to countw(string);

word1 = scan(string, j);

word2 = scan(string, k);

if word1 > word2 then do;

temp = word2;

word2 = word1;

word1 = temp;

end;

output;

end;

keep word1 word2 obsno;

run;

proc sort data=pairs nodupkey;

by obsno word1 word2;

run;

proc freq data=pairs;

tables word1 * word2 / list;

run;

*** EDITED to remove duplicate pairs within the same observation.

alex_philby · Posted 12-20-2017 10:52 AM

Astounding wrote:
One issue to consider: what if the incoming string repeats a word? Should both instances of the same word be used, doubling the number of pairs?

Theoretically, I would not want both instances of the same word to be used. But as a matter of fact, there is no occurrences where the string repeats a word in my table, except for small linking words (like "of" for example). I just have to ignore all these small words with a quick data step. So your code works great for me ! Thank you.

For other users though, it might be useful to find a solution to this issue.

alex_philby · Posted 12-21-2017 05:14 AM

      if word1 > word2 then do;

         temp = word2;

         word2 = word1;

         word1 = temp;

It works great, but could you please explain the meaning of this part of the code in the loop ?

Astounding · Posted 12-21-2017 09:19 AM

To understand the purpose of that DO group, consider this sentence:

it makes it difficult

Without that code, both of these pairs would be output:

it makes (the first and second words)

makes it (the second and third words)

Sorting with NODUPKEY would still fail to remove either of them. So that DO group switches the value of WORD1 and WORD2, and makes WORD1 the word that alphabetizes first (and WORD2 the word that alphabetizes second). So the result comes out as:

it makes

Then NODUPKEY finds the duplicates and removes one.

Reeza · Posted 12-20-2017 10:45 AM

Are there specific words you're looking for or all two word combinations, not including duplicates? It seems the order doesn't matter, which is slightly different than an n-gram? Are you including words such as I, the is (articles and conjunctions)?

alex_philby · Posted 12-20-2017 10:55 AM

Yes, all two words combinations. Not including duplicates and no articles and conjunctions would be better, as stated above.

Reeza · Posted 12-20-2017 11:13 AM

I'll leave the articles and conjuctions part up to you.

Here's one approach. I split the text into individual words, and then use SQL to find all two word combinations.
Note that this method makes it easy to find any n-gram or join with other lookup tables for sentiment analysis type work.

*Create sample data;
data random_sentences;
	infile cards truncover;
	informat sentence $256.;
	input sentence $256.;
	cards;
This is a random sentence
This is another random sentence
Happy Birthday
My job sucks.
This is a good idea, not.
This is an awesome idea!
How are you today?
Does this make sense?
Have a great day!
;

;
;
;
*Partition into words;
data f1;
	set random_sentences;
	id=_n_;
	nwords=countw(sentence);
	nchar=length(compress(sentence));

	do word_order=1 to nwords;
		word=scan(sentence, word_order);
		output;
	end;
run;

proc sql;
	create table words2 as
		select t1.sentence, lowcase(t1.word) as word1, lowcase(t2.word) as word2
			from f1 as t1
				cross join f1 as t2
			where t1.sentence=t2.sentence 
				and t1.word_order > t2.word_order
			order by t1.sentence, t1.word_order;
quit;

proc freq data=words2 noprint order=freq;
	table word1*word2 /list out=want;
run;

PeterClemmensen · Posted 12-21-2017 05:44 AM

data have;
input string:$50.;
infile datalines dlm=',';
datalines;
dog is red
I love that red dog
hot dog
red cat
that cat hates a red dog
;

data help(keep=word);
	set have;
	nWords=countw(string);
	do i=1 to nWords;
		word=scan(string, i);
		output;
	end;
run;

proc sort data=help nodupkey;
	by word;
run;

proc transpose data=help out=help_wide(drop=_NAME_) prefix=word;
	var word;
run;

data help2(keep=word1 word2);
	set help_wide;
	array words{*} $ word1-word10;
	ncomb=comb(dim(words), 2);
	do i=1 to ncomb;
		twoWords=allcomb(i, 2, of words[*]);
		output;
	end;
run;

proc sql;
	create table want as
	select a.*
		  ,sum(case when (findw(strip(string), strip(word1)) & 
					      findw(strip(string), strip(word2))) then 1
		   else 0 end) as Count
	from help2 as a, have as b
	group by word1, word2
	having calculated count > 0
	order by calculated Count desc;
run;

The DATA to DATA Step Macro
Blog: SASnrd

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away