Solved: Re: Count occurence of each word in a table

nicnad · Posted 01-08-2015 05:53 PM

Hi,

Lets say I have the following table :

COMMENTS
My cat is blue
I like dogs
The cat is blue

I would like to write a query that would count each word in each line. The logic would be that each character string preceded or followed by a space is a word.

The result would be something like :

My 1

cat 2

is 2

blue 2

dogs 1

I 1

The 1

I hope you can help me write this query.

Thank you for your help and time.

PGStats · Posted 01-08-2015 06:27 PM

Not a query, but efficient:

data have;

input COMMENTS $80.;

datalines;

My cat is blue

I like dogs

The cat is blue

;

data temp / view=temp;

length word $12;

set have;

do i = 1 by 1 until(missing(word));

word = upcase(scan(COMMENTS, i));

if not missing(word) then output;

end;

keep word;

run;

proc freq data=temp;

table word / noprint out=wordfreq(drop=percent);

run;

or if you want a query, replace last step by:

proc sql;

create table wordcount2 as

select word, count(*) as COUNT

from temp

group by word;

quit;

PG

View solution in original post

PGStats · Posted 01-08-2015 06:27 PM

Not a query, but efficient:

data have;

input COMMENTS $80.;

datalines;

My cat is blue

I like dogs

The cat is blue

;

data temp / view=temp;

length word $12;

set have;

do i = 1 by 1 until(missing(word));

word = upcase(scan(COMMENTS, i));

if not missing(word) then output;

end;

keep word;

run;

proc freq data=temp;

table word / noprint out=wordfreq(drop=percent);

run;

or if you want a query, replace last step by:

proc sql;

create table wordcount2 as

select word, count(*) as COUNT

from temp

group by word;

quit;

PG

Haikuo · Posted 01-08-2015 10:34 PM

Here is another possibility using Hash, also efficient. (Part of code was stolen from PG's post):

data have;

input COMMENTS $80.;

datalines;

My cat is blue

I like dogs

The cat is blue

;

data _null_;

if _n_=1 then

do;

declare hash h();

h.definekey('word');

h.definedata('word', 'count');

h.definedone();

end;

length word $12;

count=.;

set have end=done;

do i = 1 by 1 until(missing(word));

word = upcase(scan(COMMENTS, i));

if not missing(word) then

do;

rc=h.find();

if rc=0 then

do;

count+1;

rc=h.replace();

end;

else

do;

count=1;

rc=h.replace();

end;

if done then

rc=h.output(dataset:'want');

run;

KachiM · Posted 01-09-2015 12:21 AM

Yet another simplified version of HASH Solution:

data have;

length word $8;

input;

nwords = countw(_infile_, ' ');

do i = 1 to nwords;

word = scan(_infile_, i, ' ');

output;

end;

datalines;

My cat is blue

I like dogs

The cat is blue

;

run;

data _null_;

if _n_ = 1 then do;

length word $8;

declare hash h();

h.definekey('word');

h.definedata('word', 'count');

h.definedone();

end;

do until(last);

set have end = last;

if h.find() ^= 0 then count = 0;

count + 1;

h.replace();

end;

if last then h.output(dataset:'W_Counts');

run;

proc print data = w_counts;

run;

FriedEgg · Posted 01-09-2015 01:34 AM

You could also use hash with SUMINC to keep track of your count.

data have;

input COMMENTS $80.;

datalines;

My cat is blue

I like dogs

The cat is blue

;

data want;

length word $ 12;

cnt=1;

declare hash h(suminc:'cnt');

h.definekey('word');

h.definedone();

do until(done);

set have end=done;

do _n_=1 by 1 until(missing(word));

word=scan(comments, _n_);

h.ref();

end;

rc=h.output(dataset:'foo');

declare hiter hi('h');

rc=hi.first();

do while(rc=0);

rc=h.sum(sum:count);

output;

rc=hi.next();

end;

keep word count;

run;

nicnad · Posted 01-09-2015 02:54 PM

Thank you all for your replies.

PGStats solution works like a charm.

PG one thing I am not sure I fully understand in your code is the 1 by 1 until step...

Is 1 by 1 a required statement for an until loop in the SAS language?

What does 1 by 1 really do in the loop?

Thank you for your help

PGStats · Posted 01-09-2015 03:13 PM

do i = 1 by 1 until(some condition); is the same as do until(some condition); but provides an iteration counter, i, that can be used inside the loop. There is no upper limit on the number of iterations. You can add an upper limit if you want, with the syntax do i = 1 to 10000 by 1 until(some condition); - PG

PG

nicnad · Posted 01-09-2015 04:51 PM

Thank you both for your replies.

Everything is much clearer now!

Best regards.

naveen_srini · Posted 01-09-2015 03:19 PM

PG is basically using that as a index or in other words count variable to retireve each word in the string one by one as the scan function does scan(string,count) where count can be number, variable or a derived expression. so in your case he is using index variable as count to extract starting from my, then cat and so on.

You could also use something like:

do until(word=' ');

count+1;

word = scan(string, count);

if not missing(word) then output;

end;

Just different ways to write the code.

Regards,

Naveen

PGStats · Posted 01-09-2015 06:01 PM

You would also need to initialize count=0 before the loop.

PG

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!