BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
MikeTurner
Calcite | Level 5

Dear All,

I'm doing one text mining project.One observation is one string and one indicator. I need to count the frequency of each word in the string and I will get one new dataset with many columns. The name of one column is the word and the content is the the frequency of the word in one observation. The indicator remains unchanged. How to implent this using sas if I have hundreds of words in one string? Thanks.

Have

stringindicator
a b c1
a c d d a1
a b c2
f g g h2
e f e a m1

Want

abcedfghmindicator
1110000001
2010200001
1110000002
0000012102
1002010011

Tomi


test.JPG
1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

You will have trouble if your "words" are not valid SAS variable names.

data tall / view=tall ;

  row + 1;

  set have ;

  do i=1 to countw(string,' ');

    word = scan(string,i,' ');

    output ;

  end;

  keep row indicator word ;

run;

proc summary data=tall nway ;

  by row indicator ;

  class word ;

  output out=tall_count;

run;

proc transpose data=tall_count out=want (drop=_name_);

  by row indicator ;

  var _freq_ ;

  id word ;

run;

proc print; run;

View solution in original post

8 REPLIES 8
Scott_Mitchell
Quartz | Level 8

How about this?

DATA STUFF;

  SET HAVE;

  A = 0; B = 0; C = 0; D = 0; E = 0; F = 0; G = 0; H = 0; M = 0;

  ARRAY VARS A B C D E F G H M;

  DO J = 1 TO DIM(VARS);

  DO I = 1 TO LENGTH(COMPRESS(STRING));

  IF VNAME(VARS{J}) = SCAN(STRING,I) THEN VARS{J} = VARS{J}+INDICATOR;

  END;

  END;

  DROP I J;

RUN;

MikeTurner
Calcite | Level 5

Thanks. But it will not work if if you have hundreds of words in one string................

Scott_Mitchell
Quartz | Level 8

You probably should provide a sample that is representative of the data you will be working with.

It appears to work for me without issue for 200 characters.

data_null__
Jade | Level 19

As long as the words are "SAS Names".

data ind;
   input string &$32. indicator;
   id + 1;
  
cards;
a b c  1
a c d d a  1
a b c  2
f g g h  2
e f e a m  1
;;;;
   run;
data ind2;
   set ind;
   length word $32;
  
do i = 1 by 1;
      word = scan(string,i,
' ');
      if missing(word) then leave;
      output;
     
end;
  
run;
proc summary data=ind2 nway;
  
class id indicator word;
   output out=sum(drop=_type_);
   run;
proc transpose data=sum out=final;
   by id indicator;
   var _freq_;
   id word;
   run;
proc stdize reponly out=poked missing=0;
  
var _numeric_;
   run;
Tom
Super User Tom
Super User

You will have trouble if your "words" are not valid SAS variable names.

data tall / view=tall ;

  row + 1;

  set have ;

  do i=1 to countw(string,' ');

    word = scan(string,i,' ');

    output ;

  end;

  keep row indicator word ;

run;

proc summary data=tall nway ;

  by row indicator ;

  class word ;

  output out=tall_count;

run;

proc transpose data=tall_count out=want (drop=_name_);

  by row indicator ;

  var _freq_ ;

  id word ;

run;

proc print; run;

Haikuo
Onyx | Level 15

Here is another option using Hash objects.

data have;

input string :&$10.     indicator;

cards;

a b c 1

a c d d a   1

a b c       2

f g g h         2

e f e a m   1

;

data _null_;

  if _n_=1 then do;

  length mvar $ 200;

  retain var ' ';/*This would be your maximum length of your individual word*/

   declare hash h();

   h.definekey('var');

   h.definedata('var');

   h.definedone();

   declare hiter hi('h');

end;

set have end=last;

   do i=1 to countw(string);

     var=scan(string,i);

      rc=h.ref();

      end;

if last then do;

   do rc=hi.first() by 0 while (rc=0);

    mvar=catx(' ',mvar,var);

     rc=hi.next();

   end;

   call symputx('vname',mvar);

end;

 

run;

data want;

  array _var &vname;

  set have;

   do over _var;

     do i=1 to countw(string);

     _var+(vname(_var)=scan(string,i));

       end;

   end;

   output;

   call missing(of _var(*));

   drop string i;

run;


Several things to keep in mind:

1. the max length of macro variable is 64K (65,534) in SAS 9.3, the it works if the total length of all of your distinct words(plus blanks in between) is less than that.

2. the max length of any individual word is 32 char.

3. In current code, case is NOT ignored. It can be easily flipped.

Good luck,

Haikuo

MikeTurner
Calcite | Level 5

Thanks a lot for all kind reply. Greatly appreciated.

Mike

UrvishShah
Fluorite | Level 6

I know it has been marked as ANSWERED...but still here is my try if you like...

data want;

   set have;

   i = 1;

   str = scan(string,i);

   do while(str NE " ");

      j + 1;

      _str_ = str;

        i = i + 1;

        str = scan(string,i);

        output;

   end;

run;

proc sort data = want;

   by string indicator;

run;

proc freq data = want noprint;

   tables _str_ / out = freq(drop = percent);

   by string  indicator;

run;

proc transpose data = freq out = want(drop = _name_ _label_);

   by string indicator;

   var count;

   id _str_;

run;

data want(drop = i);

   set want;

   array temp(*) _numeric_;

   do i = 1 to dim(temp);

     if temp(i) = . then temp(i) = 0;

   end;

run;

-Urvish

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 8 replies
  • 8824 views
  • 6 likes
  • 6 in conversation