- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Lets say I have the following table :
COMMENTS |
---|
My cat is blue |
I like dogs |
The cat is blue |
I would like to write a query that would count each word in each line. The logic would be that each character string preceded or followed by a space is a word.
The result would be something like :
My 1
cat 2
is 2
blue 2
dogs 1
I 1
The 1
I hope you can help me write this query.
Thank you for your help and time.
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Not a query, but efficient:
data have;
input COMMENTS $80.;
datalines;
My cat is blue
I like dogs
The cat is blue
;
data temp / view=temp;
length word $12;
set have;
do i = 1 by 1 until(missing(word));
word = upcase(scan(COMMENTS, i));
if not missing(word) then output;
end;
keep word;
run;
proc freq data=temp;
table word / noprint out=wordfreq(drop=percent);
run;
or if you want a query, replace last step by:
proc sql;
create table wordcount2 as
select word, count(*) as COUNT
from temp
group by word;
quit;
PG
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Not a query, but efficient:
data have;
input COMMENTS $80.;
datalines;
My cat is blue
I like dogs
The cat is blue
;
data temp / view=temp;
length word $12;
set have;
do i = 1 by 1 until(missing(word));
word = upcase(scan(COMMENTS, i));
if not missing(word) then output;
end;
keep word;
run;
proc freq data=temp;
table word / noprint out=wordfreq(drop=percent);
run;
or if you want a query, replace last step by:
proc sql;
create table wordcount2 as
select word, count(*) as COUNT
from temp
group by word;
quit;
PG
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Here is another possibility using Hash, also efficient. (Part of code was stolen from PG's post):
data have;
input COMMENTS $80.;
datalines;
My cat is blue
I like dogs
The cat is blue
;
data _null_;
if _n_=1 then
do;
declare hash h();
h.definekey('word');
h.definedata('word', 'count');
h.definedone();
end;
length word $12;
count=.;
set have end=done;
do i = 1 by 1 until(missing(word));
word = upcase(scan(COMMENTS, i));
if not missing(word) then
do;
rc=h.find();
if rc=0 then
do;
count+1;
rc=h.replace();
end;
else
do;
count=1;
rc=h.replace();
end;
end;
end;
if done then
rc=h.output(dataset:'want');
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Yet another simplified version of HASH Solution:
data have;
length word $8;
input;
nwords = countw(_infile_, ' ');
do i = 1 to nwords;
word = scan(_infile_, i, ' ');
output;
end;
datalines;
My cat is blue
I like dogs
The cat is blue
;
run;
data _null_;
if _n_ = 1 then do;
length word $8;
declare hash h();
h.definekey('word');
h.definedata('word', 'count');
h.definedone();
end;
do until(last);
set have end = last;
if h.find() ^= 0 then count = 0;
count + 1;
h.replace();
end;
if last then h.output(dataset:'W_Counts');
run;
proc print data = w_counts;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
You could also use hash with SUMINC to keep track of your count.
data have;
input COMMENTS $80.;
datalines;
My cat is blue
I like dogs
The cat is blue
;
data want;
length word $ 12;
cnt=1;
declare hash h(suminc:'cnt');
h.definekey('word');
h.definedone();
do until(done);
set have end=done;
do _n_=1 by 1 until(missing(word));
word=scan(comments, _n_);
h.ref();
end;
end;
rc=h.output(dataset:'foo');
declare hiter hi('h');
rc=hi.first();
do while(rc=0);
rc=h.sum(sum:count);
output;
rc=hi.next();
end;
keep word count;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thank you all for your replies.
PGStats solution works like a charm.
PG one thing I am not sure I fully understand in your code is the 1 by 1 until step...
Is 1 by 1 a required statement for an until loop in the SAS language?
What does 1 by 1 really do in the loop?
Thank you for your help
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
do i = 1 by 1 until(some condition); is the same as do until(some condition); but provides an iteration counter, i, that can be used inside the loop. There is no upper limit on the number of iterations. You can add an upper limit if you want, with the syntax do i = 1 to 10000 by 1 until(some condition); - PG
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thank you both for your replies.
Everything is much clearer now!
Best regards.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
PG is basically using that as a index or in other words count variable to retireve each word in the string one by one as the scan function does scan(string,count) where count can be number, variable or a derived expression. so in your case he is using index variable as count to extract starting from my, then cat and so on.
You could also use something like:
do until(word=' ');
count+1;
word = scan(string, count);
if not missing(word) then output;
end;
Just different ways to write the code.
Regards,
Naveen
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
You would also need to initialize count=0 before the loop.