BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
alex_philby
Obsidian | Level 7

Hi,

 

Let's say I have the following table :

 

dog is red
I love that red dog
hot dog
red cat
that cat hates a red dog

 

I'd like to have a count of occurences of every two word groups, something that would look like this :

 

two words groupcount
red+dog3
that+dog2
red+cat2

 

I have managed to cout occurences of each word, thanks to another thread, but I have no idea how to do this.

1 ACCEPTED SOLUTION

Accepted Solutions
Astounding
PROC Star

One issue to consider:  what if the incoming string repeats a word?  Should both instances of the same word be used, doubling the number of pairs? 

 

The approach below counts double, when the word appears twice.  You can program around that, so we can come back to that if necessary.

 

data pairs;

set have;

if countw(string) > 1;

obsno = _n_;

do j=1 to countw(string) - 1;

   do k=j+1 to countw(string);

      word1 = scan(string, j);

      word2 = scan(string, k);

      if word1 > word2 then do;

         temp = word2;

         word2 = word1;

         word1 = temp;

      end;

      output;

   end;

end;

keep word1 word2 obsno;

run;

 

proc sort data=pairs nodupkey;

by obsno word1 word2;

run;

 

proc freq data=pairs;

tables word1 * word2 / list;

run;

 

*** EDITED to remove duplicate pairs within the same observation.

View solution in original post

10 REPLIES 10
PeterClemmensen
Tourmaline | Level 20

Do you want to set those two word groups yourself or a list of all possible two word groups from the sentences above?

alex_philby
Obsidian | Level 7
I want a list of every possible two words groups.
Astounding
PROC Star

One issue to consider:  what if the incoming string repeats a word?  Should both instances of the same word be used, doubling the number of pairs? 

 

The approach below counts double, when the word appears twice.  You can program around that, so we can come back to that if necessary.

 

data pairs;

set have;

if countw(string) > 1;

obsno = _n_;

do j=1 to countw(string) - 1;

   do k=j+1 to countw(string);

      word1 = scan(string, j);

      word2 = scan(string, k);

      if word1 > word2 then do;

         temp = word2;

         word2 = word1;

         word1 = temp;

      end;

      output;

   end;

end;

keep word1 word2 obsno;

run;

 

proc sort data=pairs nodupkey;

by obsno word1 word2;

run;

 

proc freq data=pairs;

tables word1 * word2 / list;

run;

 

*** EDITED to remove duplicate pairs within the same observation.

alex_philby
Obsidian | Level 7

Astounding wrote:

One issue to consider:  what if the incoming string repeats a word?  Should both instances of the same word be used, doubling the number of pairs?

 

Theoretically, I would not want both instances of the same word to be used. But as a matter of fact, there is no occurrences where the string repeats a word in my table, except for small linking words (like "of" for example). I just have to ignore all these small words with a quick data step. So your code works great for me ! Thank you.

For other users though, it might be useful to find a solution to this issue.

alex_philby
Obsidian | Level 7


      if word1 > word2 then do;

         temp = word2;

         word2 = word1;

         word1 = temp;

 

It works great, but could you please explain the meaning of this part of the code in the loop ?

Astounding
PROC Star

To understand the purpose of that DO group, consider this sentence:

 

it makes it difficult

 

Without that code, both of these pairs would be output:

 

it makes (the first and second words)

makes it (the second and third words)

 

Sorting with NODUPKEY would still fail to remove either of them.  So that DO group switches the value of WORD1 and WORD2, and makes WORD1 the word that alphabetizes first (and WORD2 the word that alphabetizes second).  So the result comes out as:

 

it makes

it makes

 

Then NODUPKEY finds the duplicates and removes one.

Reeza
Super User

Are there specific words you're looking for or all two word combinations, not including duplicates? It seems the order doesn't matter, which is slightly different than an n-gram? Are you including words such as I, the is (articles and conjunctions)?

 

alex_philby
Obsidian | Level 7
Yes, all two words combinations. Not including duplicates and no articles and conjunctions would be better, as stated above.
Reeza
Super User

I'll leave the articles and conjuctions part up to you.

 

Here's one approach. I split the text into individual words, and then use SQL to find all two word combinations. 
Note that this method makes it easy to find any n-gram or join with other lookup tables for sentiment analysis type work. 

 

*Create sample data;
data random_sentences;
	infile cards truncover;
	informat sentence $256.;
	input sentence $256.;
	cards;
This is a random sentence
This is another random sentence
Happy Birthday
My job sucks.
This is a good idea, not.
This is an awesome idea!
How are you today?
Does this make sense?
Have a great day!
;

;
;
;
*Partition into words;
data f1;
	set random_sentences;
	id=_n_;
	nwords=countw(sentence);
	nchar=length(compress(sentence));

	do word_order=1 to nwords;
		word=scan(sentence, word_order);
		output;
	end;
run;

proc sql;
	create table words2 as
		select t1.sentence, lowcase(t1.word) as word1, lowcase(t2.word) as word2
			from f1 as t1
				cross join f1 as t2
			where t1.sentence=t2.sentence 
				and t1.word_order > t2.word_order
			order by t1.sentence, t1.word_order;
quit;

proc freq data=words2 noprint order=freq;
	table word1*word2 /list out=want;
run;
PeterClemmensen
Tourmaline | Level 20
data have;
input string:$50.;
infile datalines dlm=',';
datalines;
dog is red
I love that red dog
hot dog
red cat
that cat hates a red dog
;

data help(keep=word);
	set have;
	nWords=countw(string);
	do i=1 to nWords;
		word=scan(string, i);
		output;
	end;
run;

proc sort data=help nodupkey;
	by word;
run;

proc transpose data=help out=help_wide(drop=_NAME_) prefix=word;
	var word;
run;

data help2(keep=word1 word2);
	set help_wide;
	array words{*} $ word1-word10;
	ncomb=comb(dim(words), 2);
	do i=1 to ncomb;
		twoWords=allcomb(i, 2, of words[*]);
		output;
	end;
run;

proc sql;
	create table want as
	select a.*
		  ,sum(case when (findw(strip(string), strip(word1)) & 
					      findw(strip(string), strip(word2))) then 1
		   else 0 end) as Count
	from help2 as a, have as b
	group by word1, word2
	having calculated count > 0
	order by calculated Count desc;
run;

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 10 replies
  • 3252 views
  • 2 likes
  • 4 in conversation