DATA Step, Macro, Functions and more

count occurences of multiple words

Accepted Solution Solved
Reply
Occasional Contributor
Posts: 12
Accepted Solution

count occurences of multiple words

Hi,

 

Let's say I have the following table :

 

dog is red
I love that red dog
hot dog
red cat
that cat hates a red dog

 

I'd like to have a count of occurences of every two word groups, something that would look like this :

 

two words groupcount
red+dog3
that+dog2
red+cat2

 

I have managed to cout occurences of each word, thanks to another thread, but I have no idea how to do this.


Accepted Solutions
Solution
‎12-21-2017 04:38 AM
Super User
Posts: 6,629

Re: count occurences of multiple words

[ Edited ]
Posted in reply to alex_philby

One issue to consider:  what if the incoming string repeats a word?  Should both instances of the same word be used, doubling the number of pairs? 

 

The approach below counts double, when the word appears twice.  You can program around that, so we can come back to that if necessary.

 

data pairs;

set have;

if countw(string) > 1;

obsno = _n_;

do j=1 to countw(string) - 1;

   do k=j+1 to countw(string);

      word1 = scan(string, j);

      word2 = scan(string, k);

      if word1 > word2 then do;

         temp = word2;

         word2 = word1;

         word1 = temp;

      end;

      output;

   end;

end;

keep word1 word2 obsno;

run;

 

proc sort data=pairs nodupkey;

by obsno word1 word2;

run;

 

proc freq data=pairs;

tables word1 * word2 / list;

run;

 

*** EDITED to remove duplicate pairs within the same observation.

View solution in original post


All Replies
PROC Star
Posts: 1,209

Re: count occurences of multiple words

Posted in reply to alex_philby

Do you want to set those two word groups yourself or a list of all possible two word groups from the sentences above?

Occasional Contributor
Posts: 12

Re: count occurences of multiple words

I want a list of every possible two words groups.
Solution
‎12-21-2017 04:38 AM
Super User
Posts: 6,629

Re: count occurences of multiple words

[ Edited ]
Posted in reply to alex_philby

One issue to consider:  what if the incoming string repeats a word?  Should both instances of the same word be used, doubling the number of pairs? 

 

The approach below counts double, when the word appears twice.  You can program around that, so we can come back to that if necessary.

 

data pairs;

set have;

if countw(string) > 1;

obsno = _n_;

do j=1 to countw(string) - 1;

   do k=j+1 to countw(string);

      word1 = scan(string, j);

      word2 = scan(string, k);

      if word1 > word2 then do;

         temp = word2;

         word2 = word1;

         word1 = temp;

      end;

      output;

   end;

end;

keep word1 word2 obsno;

run;

 

proc sort data=pairs nodupkey;

by obsno word1 word2;

run;

 

proc freq data=pairs;

tables word1 * word2 / list;

run;

 

*** EDITED to remove duplicate pairs within the same observation.

Occasional Contributor
Posts: 12

Re: count occurences of multiple words

[ Edited ]
Posted in reply to Astounding

Astounding wrote:

One issue to consider:  what if the incoming string repeats a word?  Should both instances of the same word be used, doubling the number of pairs?

 

Theoretically, I would not want both instances of the same word to be used. But as a matter of fact, there is no occurrences where the string repeats a word in my table, except for small linking words (like "of" for example). I just have to ignore all these small words with a quick data step. So your code works great for me ! Thank you.

For other users though, it might be useful to find a solution to this issue.

Occasional Contributor
Posts: 12

Re: count occurences of multiple words

Posted in reply to Astounding


      if word1 > word2 then do;

         temp = word2;

         word2 = word1;

         word1 = temp;

 

It works great, but could you please explain the meaning of this part of the code in the loop ?

Super User
Posts: 6,629

Re: count occurences of multiple words

Posted in reply to alex_philby

To understand the purpose of that DO group, consider this sentence:

 

it makes it difficult

 

Without that code, both of these pairs would be output:

 

it makes (the first and second words)

makes it (the second and third words)

 

Sorting with NODUPKEY would still fail to remove either of them.  So that DO group switches the value of WORD1 and WORD2, and makes WORD1 the word that alphabetizes first (and WORD2 the word that alphabetizes second).  So the result comes out as:

 

it makes

it makes

 

Then NODUPKEY finds the duplicates and removes one.

Super User
Posts: 23,262

Re: count occurences of multiple words

Posted in reply to alex_philby

Are there specific words you're looking for or all two word combinations, not including duplicates? It seems the order doesn't matter, which is slightly different than an n-gram? Are you including words such as I, the is (articles and conjunctions)?

 

Occasional Contributor
Posts: 12

Re: count occurences of multiple words

Yes, all two words combinations. Not including duplicates and no articles and conjunctions would be better, as stated above.
Super User
Posts: 23,262

Re: count occurences of multiple words

Posted in reply to alex_philby

I'll leave the articles and conjuctions part up to you.

 

Here's one approach. I split the text into individual words, and then use SQL to find all two word combinations. 
Note that this method makes it easy to find any n-gram or join with other lookup tables for sentiment analysis type work. 

 

*Create sample data;
data random_sentences;
	infile cards truncover;
	informat sentence $256.;
	input sentence $256.;
	cards;
This is a random sentence
This is another random sentence
Happy Birthday
My job sucks.
This is a good idea, not.
This is an awesome idea!
How are you today?
Does this make sense?
Have a great day!
;

;
;
;
*Partition into words;
data f1;
	set random_sentences;
	id=_n_;
	nwords=countw(sentence);
	nchar=length(compress(sentence));

	do word_order=1 to nwords;
		word=scan(sentence, word_order);
		output;
	end;
run;

proc sql;
	create table words2 as
		select t1.sentence, lowcase(t1.word) as word1, lowcase(t2.word) as word2
			from f1 as t1
				cross join f1 as t2
			where t1.sentence=t2.sentence 
				and t1.word_order > t2.word_order
			order by t1.sentence, t1.word_order;
quit;

proc freq data=words2 noprint order=freq;
	table word1*word2 /list out=want;
run;
PROC Star
Posts: 1,209

Re: count occurences of multiple words

[ Edited ]
Posted in reply to alex_philby
data have;
input string:$50.;
infile datalines dlm=',';
datalines;
dog is red
I love that red dog
hot dog
red cat
that cat hates a red dog
;

data help(keep=word);
	set have;
	nWords=countw(string);
	do i=1 to nWords;
		word=scan(string, i);
		output;
	end;
run;

proc sort data=help nodupkey;
	by word;
run;

proc transpose data=help out=help_wide(drop=_NAME_) prefix=word;
	var word;
run;

data help2(keep=word1 word2);
	set help_wide;
	array words{*} $ word1-word10;
	ncomb=comb(dim(words), 2);
	do i=1 to ncomb;
		twoWords=allcomb(i, 2, of words[*]);
		output;
	end;
run;

proc sql;
	create table want as
	select a.*
		  ,sum(case when (findw(strip(string), strip(word1)) & 
					      findw(strip(string), strip(word2))) then 1
		   else 0 end) as Count
	from help2 as a, have as b
	group by word1, word2
	having calculated count > 0
	order by calculated Count desc;
run;
☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 10 replies
  • 175 views
  • 2 likes
  • 4 in conversation