I have a dataset with 43 string variables named VAR1 till VAR43. There are various string phrases within each value of those variables. I want a new column which counts the number of times the word "Daugther" is written within the 43 variables for each row. Can someone help me with it?
The clarification, especially including the : is very helpful.
Brute force go through each variable and each word. Increment a counter each time the desired word is encountered.
data want; set have; array w(*) var1-var43; do i=1 to dim(w); if not missing(w[i]) then do j=1 to countw(w[i]); if scan(w[i],j)='Daughter:' then count=sum(count,1); end; end; run;
This will count if the desired word occurs more than once in a single variable, which I have not seen ruled out as a possibility.
If you have other than default delimiters between words you may have to add them to the SCAN function call.
If your computer uses ASCII characters, the default delimiters are as follows:blank ! $ % & ( ) * + , - . / ; < ^ |In ASCII environments that do not contain the ^ character, the SCAN function uses the ~ character instead. If your computer uses EBCDIC characters, then the default delimiters are as follows:blank ! $ % & ( ) * + , - . / ; < ¬ | ¢
So if running in an ASCII environment and you need to find the word delimited with @ characters for example: @Dauthter; then scan wouldn't treat the @ as a delimiter unless you use
if scan(w[i],j,' !$%&()*+,-./;<^|@') ='Daughter:'
to add the @ to the existing list.
Caution: if you place that string inside double quotes instead of single as shown you will likely some macro processor errors because the % and & characters will be treated as part of the macro language when inside double quotes.
If you string values are UTF-8 and have unicode characters present then you may have other issues if special characters are adjacent to your word.
One way could be to just concatenate all these variables using catx() and then use countw(). This should work as long as the sum of your strings won't exceed 32767 characters.
What's the length of your variables var1 to var43?
@Adubhai wrote:
Hi Patrick,
Thanks for your reply. The variables have random and multiple character lengths.
So if the longest of the 43 variables is no more than floor(32767/43) in length, you don't need to worry that concatenating the random-length variables will result in a string too long to be captured in a single character variable. In that case the suggestion by @Patrick will achieve your goal.
Ok so assuming none of the variables are no more than (32767/43) in length and I am able to concatenate the variables. Then how do I count the number of times the word "Daughter" is written in the new concatenated variable?
Some incomplete elements in your description. You say count the word "Daugther" (or more likely for English speakers "Daughter").
Does case matter? Does "daugther", ""DAUGTHER", "dAughTER" or similar differences in capitalization count?
How about presence in compound words (assuming correct spelling now) such as "Daughter-in-law" or "daughtering"?
Plurals "Daughters"?
Plurals with different case?
Provide example data showing the differences I have described above and the expected resulting count for the examples.
Should only need 4 or 5 variables though instead of all 43. Provide the data as a working data step.
The clarification, especially including the : is very helpful.
Brute force go through each variable and each word. Increment a counter each time the desired word is encountered.
data want; set have; array w(*) var1-var43; do i=1 to dim(w); if not missing(w[i]) then do j=1 to countw(w[i]); if scan(w[i],j)='Daughter:' then count=sum(count,1); end; end; run;
This will count if the desired word occurs more than once in a single variable, which I have not seen ruled out as a possibility.
If you have other than default delimiters between words you may have to add them to the SCAN function call.
If your computer uses ASCII characters, the default delimiters are as follows:blank ! $ % & ( ) * + , - . / ; < ^ |In ASCII environments that do not contain the ^ character, the SCAN function uses the ~ character instead. If your computer uses EBCDIC characters, then the default delimiters are as follows:blank ! $ % & ( ) * + , - . / ; < ¬ | ¢
So if running in an ASCII environment and you need to find the word delimited with @ characters for example: @Dauthter; then scan wouldn't treat the @ as a delimiter unless you use
if scan(w[i],j,' !$%&()*+,-./;<^|@') ='Daughter:'
to add the @ to the existing list.
Caution: if you place that string inside double quotes instead of single as shown you will likely some macro processor errors because the % and & characters will be treated as part of the macro language when inside double quotes.
If you string values are UTF-8 and have unicode characters present then you may have other issues if special characters are adjacent to your word.
@Adubhai It would help if you could post some representative sample data. If below data have doesn't create such data then please amend it and show us "reality".
Given your term includes a column (:) it's eventually possible to use the count() function and just search for substrings in a string. If your actual data supports this then the code could be as simple as below.
data have;
infile datalines truncover dsd;
input (var1-var5)(:$30.);
datalines;
Daughter:,Daughter & Son,Daughter:And:Son:,dAughter:,Son and Daughter:
Daughter:,Daughter: & Son:,,dAughter:,Son and Daughter
;
data want;
set have;
Daughter_Cnt_1=count(cats(of var1-var5),'Daughter:','i');
Daughter_Cnt_2=count(cats(of var1-var5),'Daughter:');
run;
proc print data=want;
run;
You can of course also create an array over your variables and then loop over it as proposed by @ballardw
Registration is open! SAS is returning to Vegas for an AI and analytics experience like no other! Whether you're an executive, manager, end user or SAS partner, SAS Innovate is designed for everyone on your team. Register for just $495 by 12/31/2023.
If you are interested in speaking, there is still time to submit a session idea. More details are posted on the website.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.