Count occurence of each word in a table

Solved
Regular Contributor
Posts: 186

Count occurence of each word in a table

Hi,

Lets say I have the following table :

My cat is blue
I like dogs
The cat is blue

I would like to write a query that would count each word in each line. The logic would be that each character string preceded or followed by a space is a word.

The result would be something like :

My 1

cat 2

is 2

blue 2

dogs 1

I 1

The 1

I hope you can help me write this query.

Thank you for your help and time.

Accepted Solutions
Solution
‎01-08-2015 06:27 PM
Posts: 5,541

Re: Count occurence of each word in a table

Not a query, but efficient:

data have;

datalines;

My cat is blue

I like dogs

The cat is blue

;

data temp / view=temp;

length word \$12;

set have;

do i = 1 by 1 until(missing(word));

if not missing(word) then output;

end;

keep word;

run;

proc freq data=temp;

table word / noprint out=wordfreq(drop=percent);

run;

or if you want a query, replace last step by:

proc sql;

create table wordcount2 as

select word, count(*) as COUNT

from temp

group by word;

quit;

PG

PG

All Replies
Solution
‎01-08-2015 06:27 PM
Posts: 5,541

Re: Count occurence of each word in a table

Not a query, but efficient:

data have;

datalines;

My cat is blue

I like dogs

The cat is blue

;

data temp / view=temp;

length word \$12;

set have;

do i = 1 by 1 until(missing(word));

if not missing(word) then output;

end;

keep word;

run;

proc freq data=temp;

table word / noprint out=wordfreq(drop=percent);

run;

or if you want a query, replace last step by:

proc sql;

create table wordcount2 as

select word, count(*) as COUNT

from temp

group by word;

quit;

PG

PG
Posts: 3,167

Re: Count occurence of each word in a table

Here is another possibility using Hash, also efficient. (Part of code was stolen from PG's post):

data have;

datalines;

My cat is blue

I like dogs

The cat is blue

;

data _null_;

if _n_=1 then

do;

declare hash h();

h.definekey('word');

h.definedata('word', 'count');

h.definedone();

end;

length word \$12;

count=.;

set have end=done;

do i = 1 by 1 until(missing(word));

if not missing(word) then

do;

rc=h.find();

if rc=0 then

do;

count+1;

rc=h.replace();

end;

else

do;

count=1;

rc=h.replace();

end;

end;

end;

if done then

rc=h.output(dataset:'want');

run;

Super Contributor
Posts: 326

Re: Count occurence of each word in a table

Yet another  simplified version of HASH Solution:

data have;

length word \$8;

input;

nwords = countw(_infile_, ' ');

do i = 1 to nwords;

word = scan(_infile_, i, ' ');

output;

end;

datalines;

My cat is blue

I like dogs

The cat is blue

;

run;

data _null_;

if _n_ = 1 then do;

length word \$8;

declare hash h();

h.definekey('word');

h.definedata('word', 'count');

h.definedone();

end;

do until(last);

set have end = last;

if h.find() ^= 0 then count = 0;

count + 1;

h.replace();

end;

if last then h.output(dataset:'W_Counts');

run;

proc print data = w_counts;

run;

Posts: 1,318

Re: Count occurence of each word in a table

You could also use hash with SUMINC to keep track of your count.

data have;

datalines;

My cat is blue

I like dogs

The cat is blue

;

data want;

length word \$ 12;

cnt=1;

declare hash h(suminc:'cnt');

h.definekey('word');

h.definedone();

do until(done);

set have end=done;

do _n_=1 by 1 until(missing(word));

h.ref();

end;

end;

rc=h.output(dataset:'foo');

declare hiter hi('h');

rc=hi.first();

do while(rc=0);

rc=h.sum(sum:count);

output;

rc=hi.next();

end;

keep word count;

run;

Regular Contributor
Posts: 186

Re: Count occurence of each word in a table

Thank you all for your replies.

PGStats solution works like a charm.

PG one thing I am not sure I fully understand in your code is the 1 by 1 until step...

Is 1 by 1 a required statement for an until loop in the SAS language?

What does 1 by 1 really do in the loop?

Posts: 5,541

Re: Count occurence of each word in a table

do i = 1 by 1 until(some condition); is the same as do until(some condition); but provides an iteration counter, i, that can be used inside the loop. There is no upper limit on the number of iterations. You can add an upper limit if you want, with the syntax do i = 1 to 10000 by 1 until(some condition); - PG

PG
Regular Contributor
Posts: 186

Re: Count occurence of each word in a table

Thank you both for your replies.

Everything is much clearer now!

Best regards.

Frequent Contributor
Posts: 115

Re: Count occurence of each word in a table

PG is basically using that as a index or in other words count variable to retireve each word in the string one by one as the scan function does scan(string,count) where count can be number, variable or a derived expression. so in your case he is using index variable as count to extract starting from my, then cat and so on.

You could also use something like:

do until(word=' ');

count+1;

word = scan(string, count);

if not missing(word) then output;

end;

Just different ways to write the code.

Regards,

Naveen

Posts: 5,541