BookmarkSubscribeRSS Feed
Grumbler
Obsidian | Level 7

searched around but couldn't find what i need.

 

example,

 

string="lksadfjlkjthisiswhatineed thisiswhat ineedlkaflkasfdlkj";

 

i can use prxmatch("m/this|what|need/oi",string);

 

but it only returns the position of the first word.

 

how do i count the all of the words in this string?

 

thanks.

8 REPLIES 8
ChrisBrooks
Ammonite | Level 13

I'm normally a big advocate of regular expressions but this is simpler

 

data _null_;
	string="lksadfjlkjthisiswhatineed thisiswhat ineedlkaflkasfdlkj";
	count=count(string,'this')+count(string,'what')+count(string,'need');
	put count=;

run;
Grumbler
Obsidian | Level 7

but this only works for 3 words.  i have hundreds of keywords that i would like to count.  can't type it all like this.  it's all in macro.

 

thanks.

RW9
Diamond | Level 26 RW9
Diamond | Level 26

" it's all in macro"

There is your problem right there.  Data should be in datasets - that is what they are for.  Once data is in datasets, then you use Base SAS code to analyze that data.  For example, if I had a string in a dataset, I could achieve a count of all words quite simply with two steps:

1) datastep outputs each word of any amount fo strings to one observation per word

2) proc freq the resulting dataset to get a dataset with unique words and their counts within the data

 

Macro is not the place to be doing data processing, it is nothing more than a find/replace system for generating text.

ChrisBrooks
Ammonite | Level 13

In that case you'll need to give us a sample of your keywords, input and output in the form of have and want data sets, because (as @RW9 says) this really should be done in data step.

Ksharp
Super User
data k;
input k $;
cards;
this 
what 
need
;
run;
data have;
string="lksadfjlkjthisiswhatineed thisiswhat ineedlkaflkasfdlkj";
output;
run;

proc sql;
select string,sum(count(string,strip(k),'i')) as n
 from have,k
  group by string;
quit;
Grumbler
Obsidian | Level 7

thanks everyone for the tips.  i guess i should clarify a bit more.  what i have is millions of records of "strings" in one variable.  i have another maybe 10 or 20 lists of key words.  i would like to count each list of key words in the millions of "strings" and see which list has most frequency.  then i will decide how to categorize these strings.  was just wondering if there is a fast way to do that.  thanks.  🙂

RW9
Diamond | Level 26 RW9
Diamond | Level 26

Well, with no test data to run with I am guessing here but something like:

data biglist;
  length string $2000;
  string="a big dog walks around"; output;
  string="something happened other wise"; output;
  string="this is a wise old string with big connotations"; output;
run;

data words;
  length word $2000;
  word="dog"; output;
  word="big"; output;
  word="wise"; output;
run;

data inter (drop=i string);
  set biglist;
  do i=1 to countw(string," ");
    wrd=scan(string,i," ");
    output;
  end;
run;

proc sql;
  delete from inter 
  where wrd not in (select word from words);
quit;

proc freq data=inter;
  tables wrd / out=want;
run;

You can drop the sql delete and do freq over all the data, then filter the results, might be less resource - you will need to try it.  

Ksharp
Super User
You could try my SQL. Maybe that is not too slowly. A faster way I can think is using Hash Table.

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 8 replies
  • 2439 views
  • 0 likes
  • 4 in conversation