Solved: Re: How to count number of times a word appears in a series of string ...

Adubhai · Posted 08-14-2023 08:45 PM

I have a dataset with 43 string variables named VAR1 till VAR43. There are various string phrases within each value of those variables. I want a new column which counts the number of times the word "Daugther" is written within the 43 variables for each row. Can someone help me with it?

ballardw · Posted 08-15-2023 02:26 AM

The clarification, especially including the : is very helpful.

Brute force go through each variable and each word. Increment a counter each time the desired word is encountered.

data want;
   set have;
   array w(*) var1-var43;
   do i=1 to dim(w);
      if not missing(w[i]) then do j=1 to countw(w[i]);
         if scan(w[i],j)='Daughter:' then count=sum(count,1);
      end;
   end;
run;

This will count if the desired word occurs more than once in a single variable, which I have not seen ruled out as a possibility.

If you have other than default delimiters between words you may have to add them to the SCAN function call.

If your computer uses ASCII characters, the default delimiters are as follows:

blank ! $ % & ( ) * + , - . / ; < ^ |

In ASCII environments that do not contain the ^ character, the SCAN function uses the ~ character instead.

If your computer uses EBCDIC characters, then the default delimiters are as follows:

blank ! $ % & ( ) * + , - . / ; < ¬ | ¢

So if running in an ASCII environment and you need to find the word delimited with @ characters for example: @Dauthter; then scan wouldn't treat the @ as a delimiter unless you use

if scan(w[i],j,' !$%&()*+,-./;<^|@') ='Daughter:'

to add the @ to the existing list.

Caution: if you place that string inside double quotes instead of single as shown you will likely some macro processor errors because the % and & characters will be treated as part of the macro language when inside double quotes.

If you string values are UTF-8 and have unicode characters present then you may have other issues if special characters are adjacent to your word.

View solution in original post

Patrick · Posted 08-14-2023 09:17 PM

One way could be to just concatenate all these variables using catx() and then use countw(). This should work as long as the sum of your strings won't exceed 32767 characters.

What's the length of your variables var1 to var43?

Adubhai · Posted 08-14-2023 09:31 PM

Hi Patrick,

Thanks for your reply. The variables have random and multiple character lengths.

mkeintz · Posted 08-14-2023 10:33 PM

@Adubhai wrote:
Hi Patrick,

Thanks for your reply. The variables have random and multiple character lengths.

So if the longest of the 43 variables is no more than floor(32767/43) in length, you don't need to worry that concatenating the random-length variables will result in a string too long to be captured in a single character variable. In that case the suggestion by @Patrick will achieve your goal.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Adubhai · Posted 08-14-2023 10:48 PM

Ok so assuming none of the variables are no more than (32767/43) in length and I am able to concatenate the variables. Then how do I count the number of times the word "Daughter" is written in the new concatenated variable?

ballardw · Posted 08-14-2023 10:50 PM

Some incomplete elements in your description. You say count the word "Daugther" (or more likely for English speakers "Daughter").

Does case matter? Does "daugther", ""DAUGTHER", "dAughTER" or similar differences in capitalization count?

How about presence in compound words (assuming correct spelling now) such as "Daughter-in-law" or "daughtering"?

Plurals "Daughters"?

Plurals with different case?

Provide example data showing the differences I have described above and the expected resulting count for the examples.

Should only need 4 or 5 variables though instead of all 43. Provide the data as a working data step.

Adubhai · Posted 08-14-2023 10:52 PM

Thanks for pointing it out. It was a typo from my end. The word is specifically "Daughter:" with the ":" and no other variation of it.

ballardw · Posted 08-15-2023 02:26 AM

The clarification, especially including the : is very helpful.

Brute force go through each variable and each word. Increment a counter each time the desired word is encountered.

data want;
   set have;
   array w(*) var1-var43;
   do i=1 to dim(w);
      if not missing(w[i]) then do j=1 to countw(w[i]);
         if scan(w[i],j)='Daughter:' then count=sum(count,1);
      end;
   end;
run;

This will count if the desired word occurs more than once in a single variable, which I have not seen ruled out as a possibility.

If you have other than default delimiters between words you may have to add them to the SCAN function call.

If your computer uses ASCII characters, the default delimiters are as follows:

blank ! $ % & ( ) * + , - . / ; < ^ |

In ASCII environments that do not contain the ^ character, the SCAN function uses the ~ character instead.

If your computer uses EBCDIC characters, then the default delimiters are as follows:

blank ! $ % & ( ) * + , - . / ; < ¬ | ¢

So if running in an ASCII environment and you need to find the word delimited with @ characters for example: @Dauthter; then scan wouldn't treat the @ as a delimiter unless you use

if scan(w[i],j,' !$%&()*+,-./;<^|@') ='Daughter:'

to add the @ to the existing list.

Caution: if you place that string inside double quotes instead of single as shown you will likely some macro processor errors because the % and & characters will be treated as part of the macro language when inside double quotes.

If you string values are UTF-8 and have unicode characters present then you may have other issues if special characters are adjacent to your word.

Adubhai · Posted 08-15-2023 02:55 AM

Thanks ballardw. So did I undestand correctly that the solution you provided does not require concatenating the variables?

Patrick · Posted 08-15-2023 03:01 AM

@Adubhai It would help if you could post some representative sample data. If below data have doesn't create such data then please amend it and show us "reality".

Given your term includes a column (:) it's eventually possible to use the count() function and just search for substrings in a string. If your actual data supports this then the code could be as simple as below.

data have;
  infile datalines truncover dsd;
  input (var1-var5)(:$30.);
  datalines;
Daughter:,Daughter & Son,Daughter:And:Son:,dAughter:,Son and Daughter:
Daughter:,Daughter: & Son:,,dAughter:,Son and Daughter
;

data want;
  set have;
  Daughter_Cnt_1=count(cats(of var1-var5),'Daughter:','i');
  Daughter_Cnt_2=count(cats(of var1-var5),'Daughter:');
run;

proc print data=want;
run;

You can of course also create an array over your variables and then loop over it as proposed by @ballardw

Adubhai · Posted 08-15-2023 04:02 AM

Thanks Patrick, but I have gone with ballardw's solution and it works. Maybe yours work too but I didn't need to try it.

How to count number of times a word appears in a series of string variables in each row?

Re: How to count number of times a word appears in a series of string variables in each row?

Re: How to count number of times a word appears in a series of string variables in each row?

Re: How to count number of times a word appears in a series of string variables in each row?

Re: How to count number of times a word appears in a series of string variables in each row?

Re: How to count number of times a word appears in a series of string variables in each row?

Re: How to count number of times a word appears in a series of string variables in each row?

Re: How to count number of times a word appears in a series of string variables in each row?

Re: How to count number of times a word appears in a series of string variables in each row?

Re: How to count number of times a word appears in a series of string variables in each row?

Re: How to count number of times a word appears in a series of string variables in each row?

Re: How to count number of times a word appears in a series of string variables in each row?

Classroom Training Available!