BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
Adubhai
Obsidian | Level 7

I have a dataset with 43 string variables named VAR1 till VAR43. There are various string phrases within each value of those variables. I want a new column which counts the number of times the word "Daugther" is written within the 43 variables for each row. Can someone help me with it? 

1 ACCEPTED SOLUTION

Accepted Solutions
ballardw
Super User

The clarification, especially including the : is very helpful.

 

Brute force go through each variable and each word. Increment a counter each time the desired word is encountered.

data want;
   set have;
   array w(*) var1-var43;
   do i=1 to dim(w);
      if not missing(w[i]) then do j=1 to countw(w[i]);
         if scan(w[i],j)='Daughter:' then count=sum(count,1);
      end;
   end;
run;

This will count if the desired word occurs more than once in a single variable, which I have not seen ruled out as a possibility.

If you have other than default delimiters between words you may have to add them to the SCAN function call.

  • If your computer uses ASCII characters, the default delimiters are as follows:
    blank ! $ % & ( ) * + , - . / ; < ^ |
    In ASCII environments that do not contain the ^ character, the SCAN function uses the ~ character instead.
  • If your computer uses EBCDIC characters, then the default delimiters are as follows:
    blank ! $ % & ( ) * + , - . / ; < ¬ | ¢

So if running in an ASCII environment and you need to find the word delimited with @ characters for example: @Dauthter; then scan wouldn't treat the @ as a delimiter unless you use

if scan(w[i],j,' !$%&()*+,-./;<^|@') ='Daughter:'

to add the @ to the existing list.

Caution: if you place that string inside double quotes instead of single as shown you will likely some macro processor errors because the % and & characters will be treated as part of the macro language when inside double quotes.

 

If you string values are UTF-8 and have unicode characters present then you may have other issues if special characters are adjacent to your word.

 

 

 

View solution in original post

10 REPLIES 10
Patrick
Opal | Level 21

One way could be to just concatenate all these variables using catx() and then use countw(). This should work as long as the sum of your strings won't exceed 32767 characters. 

What's the length of your variables var1 to var43?

Adubhai
Obsidian | Level 7
Hi Patrick,

Thanks for your reply. The variables have random and multiple character lengths.
mkeintz
PROC Star

@Adubhai wrote:
Hi Patrick,

Thanks for your reply. The variables have random and multiple character lengths.

So if the longest of the 43 variables is no more than floor(32767/43) in length, you don't need to worry that concatenating the random-length variables will result in a string too long to be captured in a single character variable.    In that case the suggestion by @Patrick will achieve your goal.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Adubhai
Obsidian | Level 7

Ok so assuming none of the variables are no more than (32767/43) in length and I am able to concatenate the variables. Then how do I count the number of times the word "Daughter" is written in the new concatenated variable? 

ballardw
Super User

Some incomplete elements in your description. You say count the word "Daugther" (or more likely for English speakers "Daughter").

Does case matter? Does "daugther", ""DAUGTHER", "dAughTER" or similar differences in capitalization count?

How about presence in compound words (assuming correct spelling now) such as "Daughter-in-law" or "daughtering"?

Plurals "Daughters"?

Plurals with different case?

 

Provide example data showing the differences I have described above and the expected resulting count for the examples.

Should only need 4 or 5 variables though instead of all 43. Provide the data as a working data step.

Adubhai
Obsidian | Level 7
Thanks for pointing it out. It was a typo from my end. The word is specifically "Daughter:" with the ":" and no other variation of it.
ballardw
Super User

The clarification, especially including the : is very helpful.

 

Brute force go through each variable and each word. Increment a counter each time the desired word is encountered.

data want;
   set have;
   array w(*) var1-var43;
   do i=1 to dim(w);
      if not missing(w[i]) then do j=1 to countw(w[i]);
         if scan(w[i],j)='Daughter:' then count=sum(count,1);
      end;
   end;
run;

This will count if the desired word occurs more than once in a single variable, which I have not seen ruled out as a possibility.

If you have other than default delimiters between words you may have to add them to the SCAN function call.

  • If your computer uses ASCII characters, the default delimiters are as follows:
    blank ! $ % & ( ) * + , - . / ; < ^ |
    In ASCII environments that do not contain the ^ character, the SCAN function uses the ~ character instead.
  • If your computer uses EBCDIC characters, then the default delimiters are as follows:
    blank ! $ % & ( ) * + , - . / ; < ¬ | ¢

So if running in an ASCII environment and you need to find the word delimited with @ characters for example: @Dauthter; then scan wouldn't treat the @ as a delimiter unless you use

if scan(w[i],j,' !$%&()*+,-./;<^|@') ='Daughter:'

to add the @ to the existing list.

Caution: if you place that string inside double quotes instead of single as shown you will likely some macro processor errors because the % and & characters will be treated as part of the macro language when inside double quotes.

 

If you string values are UTF-8 and have unicode characters present then you may have other issues if special characters are adjacent to your word.

 

 

 

Adubhai
Obsidian | Level 7
Thanks ballardw. So did I undestand correctly that the solution you provided does not require concatenating the variables?
Patrick
Opal | Level 21

@Adubhai It would help if you could post some representative sample data. If below data have doesn't create such data then please amend it and show us "reality".

Given your term includes a column (:) it's eventually possible to use the count() function and just search for substrings in a string. If your actual data supports this then the code could be as simple as below.

data have;
  infile datalines truncover dsd;
  input (var1-var5)(:$30.);
  datalines;
Daughter:,Daughter & Son,Daughter:And:Son:,dAughter:,Son and Daughter:
Daughter:,Daughter: & Son:,,dAughter:,Son and Daughter
;

data want;
  set have;
  Daughter_Cnt_1=count(cats(of var1-var5),'Daughter:','i');
  Daughter_Cnt_2=count(cats(of var1-var5),'Daughter:');
run;

proc print data=want;
run;

 You can of course also create an array over your variables and then loop over it as proposed by @ballardw 

Adubhai
Obsidian | Level 7
Thanks Patrick, but I have gone with ballardw's solution and it works. Maybe yours work too but I didn't need to try it.

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 10 replies
  • 579 views
  • 0 likes
  • 4 in conversation