Dear All,
I'm doing one text mining project.One observation is one string and one indicator. I need to count the frequency of each word in the string and I will get one new dataset with many columns. The name of one column is the word and the content is the the frequency of the word in one observation. The indicator remains unchanged. How to implent this using sas if I have hundreds of words in one string? Thanks.
Have
string | indicator |
a b c | 1 |
a c d d a | 1 |
a b c | 2 |
f g g h | 2 |
e f e a m | 1 |
Want
a | b | c | e | d | f | g | h | m | indicator |
1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
2 | 0 | 1 | 0 | 2 | 0 | 0 | 0 | 0 | 1 |
1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 2 |
0 | 0 | 0 | 0 | 0 | 1 | 2 | 1 | 0 | 2 |
1 | 0 | 0 | 2 | 0 | 1 | 0 | 0 | 1 | 1 |
Tomi
You will have trouble if your "words" are not valid SAS variable names.
data tall / view=tall ;
row + 1;
set have ;
do i=1 to countw(string,' ');
word = scan(string,i,' ');
output ;
end;
keep row indicator word ;
run;
proc summary data=tall nway ;
by row indicator ;
class word ;
output out=tall_count;
run;
proc transpose data=tall_count out=want (drop=_name_);
by row indicator ;
var _freq_ ;
id word ;
run;
proc print; run;
How about this?
DATA STUFF;
SET HAVE;
A = 0; B = 0; C = 0; D = 0; E = 0; F = 0; G = 0; H = 0; M = 0;
ARRAY VARS A B C D E F G H M;
DO J = 1 TO DIM(VARS);
DO I = 1 TO LENGTH(COMPRESS(STRING));
IF VNAME(VARS{J}) = SCAN(STRING,I) THEN VARS{J} = VARS{J}+INDICATOR;
END;
END;
DROP I J;
RUN;
Thanks. But it will not work if if you have hundreds of words in one string................
You probably should provide a sample that is representative of the data you will be working with.
It appears to work for me without issue for 200 characters.
As long as the words are "SAS Names".
You will have trouble if your "words" are not valid SAS variable names.
data tall / view=tall ;
row + 1;
set have ;
do i=1 to countw(string,' ');
word = scan(string,i,' ');
output ;
end;
keep row indicator word ;
run;
proc summary data=tall nway ;
by row indicator ;
class word ;
output out=tall_count;
run;
proc transpose data=tall_count out=want (drop=_name_);
by row indicator ;
var _freq_ ;
id word ;
run;
proc print; run;
Here is another option using Hash objects.
data have;
input string :&$10. indicator;
cards;
a b c 1
a c d d a 1
a b c 2
f g g h 2
e f e a m 1
;
data _null_;
if _n_=1 then do;
length mvar $ 200;
retain var ' ';/*This would be your maximum length of your individual word*/
declare hash h();
h.definekey('var');
h.definedata('var');
h.definedone();
declare hiter hi('h');
end;
set have end=last;
do i=1 to countw(string);
var=scan(string,i);
rc=h.ref();
end;
if last then do;
do rc=hi.first() by 0 while (rc=0);
mvar=catx(' ',mvar,var);
rc=hi.next();
end;
call symputx('vname',mvar);
end;
run;
data want;
array _var &vname;
set have;
do over _var;
do i=1 to countw(string);
_var+(vname(_var)=scan(string,i));
end;
end;
output;
call missing(of _var(*));
drop string i;
run;
Several things to keep in mind:
1. the max length of macro variable is 64K (65,534) in SAS 9.3, the it works if the total length of all of your distinct words(plus blanks in between) is less than that.
2. the max length of any individual word is 32 char.
3. In current code, case is NOT ignored. It can be easily flipped.
Good luck,
Haikuo
Thanks a lot for all kind reply. Greatly appreciated.
Mike
I know it has been marked as ANSWERED...but still here is my try if you like...
data want;
set have;
i = 1;
str = scan(string,i);
do while(str NE " ");
j + 1;
_str_ = str;
i = i + 1;
str = scan(string,i);
output;
end;
run;
proc sort data = want;
by string indicator;
run;
proc freq data = want noprint;
tables _str_ / out = freq(drop = percent);
by string indicator;
run;
proc transpose data = freq out = want(drop = _name_ _label_);
by string indicator;
var count;
id _str_;
run;
data want(drop = i);
set want;
array temp(*) _numeric_;
do i = 1 to dim(temp);
if temp(i) = . then temp(i) = 0;
end;
run;
-Urvish
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.