Solved
Contributor
Posts: 49

How to count the frequency like this?

Dear All,

I'm doing one text mining project.One observation is one string and one indicator. I need to count the frequency of each word in the string and I will get one new dataset with many columns. The name of one column is the word and the content is the the frequency of the word in one observation. The indicator remains unchanged. How to implent this using sas if I have hundreds of words in one string? Thanks.

Have

 string indicator a b c 1 a c d d a 1 a b c 2 f g g h 2 e f e a m 1

Want

 a b c e d f g h m indicator 1 1 1 0 0 0 0 0 0 1 2 0 1 0 2 0 0 0 0 1 1 1 1 0 0 0 0 0 0 2 0 0 0 0 0 1 2 1 0 2 1 0 0 2 0 1 0 0 1 1

Tomi

Accepted Solutions
Solution
‎08-05-2013 08:45 AM
Super User
Posts: 8,114

Re: How to count the frequency like this?

You will have trouble if your "words" are not valid SAS variable names.

data tall / view=tall ;

row + 1;

set have ;

do i=1 to countw(string,' ');

word = scan(string,i,' ');

output ;

end;

keep row indicator word ;

run;

proc summary data=tall nway ;

by row indicator ;

class word ;

output out=tall_count;

run;

proc transpose data=tall_count out=want (drop=_name_);

by row indicator ;

var _freq_ ;

id word ;

run;

proc print; run;

All Replies
Super Contributor
Posts: 312

Re: How to count the frequency like this?

DATA STUFF;

SET HAVE;

A = 0; B = 0; C = 0; D = 0; E = 0; F = 0; G = 0; H = 0; M = 0;

ARRAY VARS A B C D E F G H M;

DO J = 1 TO DIM(VARS);

DO I = 1 TO LENGTH(COMPRESS(STRING));

IF VNAME(VARS{J}) = SCAN(STRING,I) THEN VARS{J} = VARS{J}+INDICATOR;

END;

END;

DROP I J;

RUN;

Contributor
Posts: 49

Re: How to count the frequency like this?

Thanks. But it will not work if if you have hundreds of words in one string................

Super Contributor
Posts: 312

Re: How to count the frequency like this?

You probably should provide a sample that is representative of the data you will be working with.

It appears to work for me without issue for 200 characters.

Posts: 3,852

Re: How to count the frequency like this?

As long as the words are "SAS Names".

data ind;
input string &\$32. indicator;
id + 1;

cards;
a b c  1
a c d d a  1
a b c  2
f g g h  2
e f e a m  1
;;;;
run;
data ind2;
set ind;
length word \$32;

do i = 1 by 1;
word = scan(string,i,
' ');
if missing(word) then leave;
output;

end;

run;
proc summary data=ind2 nway;

class id indicator word;
output out=sum(drop=_type_);
run;
proc transpose data=sum out=final;
by id indicator;
var _freq_;
id word;
run;
proc stdize reponly out=poked missing=0;

var _numeric_;
run;
Solution
‎08-05-2013 08:45 AM
Super User
Posts: 8,114

Re: How to count the frequency like this?

You will have trouble if your "words" are not valid SAS variable names.

data tall / view=tall ;

row + 1;

set have ;

do i=1 to countw(string,' ');

word = scan(string,i,' ');

output ;

end;

keep row indicator word ;

run;

proc summary data=tall nway ;

by row indicator ;

class word ;

output out=tall_count;

run;

proc transpose data=tall_count out=want (drop=_name_);

by row indicator ;

var _freq_ ;

id word ;

run;

proc print; run;

Posts: 3,167

Re: How to count the frequency like this?

Here is another option using Hash objects.

data have;

input string :&\$10.     indicator;

cards;

a b c 1

a c d d a   1

a b c       2

f g g h         2

e f e a m   1

;

data _null_;

if _n_=1 then do;

length mvar \$ 200;

retain var ' ';/*This would be your maximum length of your individual word*/

declare hash h();

h.definekey('var');

h.definedata('var');

h.definedone();

declare hiter hi('h');

end;

set have end=last;

do i=1 to countw(string);

var=scan(string,i);

rc=h.ref();

end;

if last then do;

do rc=hi.first() by 0 while (rc=0);

mvar=catx(' ',mvar,var);

rc=hi.next();

end;

call symputx('vname',mvar);

end;

run;

data want;

array _var &vname;

set have;

do over _var;

do i=1 to countw(string);

_var+(vname(_var)=scan(string,i));

end;

end;

output;

call missing(of _var(*));

drop string i;

run;

Several things to keep in mind:

1. the max length of macro variable is 64K (65,534) in SAS 9.3, the it works if the total length of all of your distinct words(plus blanks in between) is less than that.

2. the max length of any individual word is 32 char.

3. In current code, case is NOT ignored. It can be easily flipped.

Good luck,

Haikuo

Contributor
Posts: 49

Re: How to count the frequency like this?

Thanks a lot for all kind reply. Greatly appreciated.

Mike

Regular Contributor
Posts: 195

Re: How to count the frequency like this?

I know it has been marked as ANSWERED...but still here is my try if you like...

data want;

set have;

i = 1;

str = scan(string,i);

do while(str NE " ");

j + 1;

_str_ = str;

i = i + 1;

str = scan(string,i);

output;

end;

run;

proc sort data = want;

by string indicator;

run;

proc freq data = want noprint;

tables _str_ / out = freq(drop = percent);

by string  indicator;

run;

proc transpose data = freq out = want(drop = _name_ _label_);

by string indicator;

var count;

id _str_;

run;

data want(drop = i);

set want;

array temp(*) _numeric_;

do i = 1 to dim(temp);

if temp(i) = . then temp(i) = 0;

end;

run;

-Urvish

🔒 This topic is solved and locked.