Find counts of multiple substrings across dataset.

Reply
Super Contributor
Posts: 418

Find counts of multiple substrings across dataset.

Hello everyone. I have a dataset that is just one variable called comment (it's comment data across 200-250 rows).  I was tasked with finding the number of times (counts) that a specific subset of words occures across the dataset.

Ex: How many times do the words "loss", "death", "happiness","girlfriend" occur across the data.


I thought this would be something I could automate in a do loop, similar to below. However it does not work. I am actually a bit stuck on how to get this one working in a loop (within a dataset, I could always wrap the entire thing in a Macro loop, but i'd like to write it into a dataset loop for efficiencies sake).

I have one dataset that has the variable names  like so.

data person;

  infile datalines delimiter=',';

  input variablestosearch $ ;

  datalines; 

LOSS,

DEATH,

HAPPINESS,

GIRLFRIEND

;

I know I can write a macro loop to loop across this variable, and then create a dataset for each variable and then append them together, similar to below.

%do i=1 %to 4;

data _null_;

set person;

if _n_=&i then call symputx('variable',variablestosearch);

run;

data answer1(keep=type newvar);

set answer end=eof;

variable="&variable";

countvar=count(upcase(comments),"&variable");

newvar+countvar;

type="&Variable";

if eof=1 then output;

run;

proc datasets;

append base=FINALdataset data=answer1;

run;

%end;

However this has the problem of reading from the dataset 4 times. I was curious if there was a way to make a macro string around the variable name, and then pull that string into ONE dataset read in, and output one file from one readin... Please let me know if this question does not make sense.

Super User
Posts: 10,460

Re: Find counts of multiple substrings across dataset.

Assuming that the words your are interested do not occur as part of a longer word then the COUNT function looks likely:

From the documentation:

xyz='This is a thistle? Yes, this is a thistle.';

howmanythis=count(xyz,'this');

put howmanythis;

So you could either use array or litterals;

DeathCount = count(variablestosearch,'DEATH','i'); /* i says to ignore case in comparisons*/

Assuming you don't have too many literals to search for add an additional line of code, otherwise Array of litterals and result variables may be the way to go.

Super Contributor
Posts: 418

Re: Find counts of multiple substrings across dataset.

Hello Ballard!  Thanks again for all your help as always you are extremely friendly and Helpful. This time  I actually probably didn't state my question very well. I actually know HOW to get the count and then sum across rows (the step above in the macro loop should do just that). i am just looking for a way to make it more efficient if I have to loop across say, 100 different words....

The problem is if I have 100 different words, the above loop I wrote is reading the dataset 100 times, and this is simply too slow. So I was wondering if there was a way to sum across rows multiple different variables in a loop, while reading in the dataset only once?

Does that make more sense?

Super User
Posts: 10,460

Re: Find counts of multiple substrings across dataset.

Array processing is probably the quicker way.

length w1 - w4 $ 15; /* make the length that of the longest word your going to search for */

Array words w1 w2 w3 w4 ('Loss', 'Death','Happiness','Girlfriend'); /* Initialize array with specified values, yes it gets long with 100

or use 100 w1=' '; type statements and just have array words w1-w100*/

Array wcount wc1 -wc4;

do _i_ = 1 to dim(words);

     wc[_i_] = count(variablestosearch,w[_i_],'i');

end;

recommend a suitable label statement for the wc variables OR descriptive names.

Super User
Posts: 17,737

Re: Find counts of multiple substrings across dataset.

Split your sentence into words, identified by the sentence and then run a proc freq, and put your words in the where clause.

data comments;

    comment="I broke up with my girlfriend. This makes me sad";output;

    comment="I just got a girlfriend. Yay!"; output;

    comment="I lost my job, this makes me sad"; output;

run;

data comments_flipped;

    set comments;

    i=1;

    str=lowcase(scan(comment, i, " .!,"));

    sentence=_n_;

    do while (str ne "");

        output;

        i+1;

        str=lowcase(scan(comment, i, " .!,"));

    end;

run;

proc freq data=comments_flipped;

    table str;

    where str in ("girlfriend", "sad");

run;

Ask a Question
Discussion stats
  • 4 replies
  • 842 views
  • 0 likes
  • 3 in conversation