Summing categorical variables

wj2 · Posted 12-07-2018 04:31 PM

Hello,

I have a set of 21 different binary (yes vs. no) variables and I would ultimately like to determine the average number of total "Yes" responses among my sample. That is, what is the average number of "Yes" responses out of the total of 21 possible "Yes" responses among the sample? Can someone please suggest an efficient way of coding a variable to do this? Thanks!

Reeza · Posted 12-07-2018 04:33 PM

Are you doing this per line or for the whole data set?

In general, you can just take the mean for a whole line.

new_var = mean(of var1-var21);

If it's all variables that need to be considered together it becomes a bit harder. Please clarify, some sample input and output data would help as well.

@wj2 wrote:

Hello,

I have a set of 21 different binary (yes vs. no) variables and I would ultimately like to determine the average number of total "Yes" responses among my sample. That is, what is the average number of "Yes" responses out of the total of 21 possible "Yes" responses among the sample? Can someone please suggest an efficient way of coding a variable to do this? Thanks!

wj2 · Posted 12-07-2018 05:33 PM

Thank you both for the prompt reply. Basically, I am working with a large survey data set (>4,000 subjects). The 21 variables correspond to 21 different medications used (yes (1) vs. no (0)). Among the total sample, I would like to know the average number of different medications used. So far, I have tried something like this but I'm not sure if this is correct:

new_var= (drug1=1)+(drug2=1)+(drug3=1)+(drug4=1)+(drug5=1)+(drug6=1)+(drug7=1)+(drug8=1)+ (drug9=1)+(drug10=1)+(drug11=1)+(drug12=1)+(drug13=1)+(drug14=1)+(drug15=1)+(drug16=1)+ (drug17=1)+(drug18=1)+(drug19=1)+(drug20+1)+(drug21=1);

To find the mean, I have just used the proc means procedure for the new variable:

proc means data=X;

var new_var;

run;

However, I am not sure if this is correct? Is there a better or more efficient way of doing this?

Please let me know if I can clarify further.

Thanks.

ballardw · Posted 12-07-2018 06:38 PM

@wj2 wrote:

Thank you both for the prompt reply. Basically, I am working with a large survey data set (>4,000 subjects). The 21 variables correspond to 21 different medications used (yes (1) vs. no (0)). Among the total sample, I would like to know the average number of different medications used. So far, I have tried something like this but I'm not sure if this is correct:

new_var= (drug1=1)+(drug2=1)+(drug3=1)+(drug4=1)+(drug5=1)+(drug6=1)+(drug7=1)+(drug8=1)+ (drug9=1)+(drug10=1)+(drug11=1)+(drug12=1)+(drug13=1)+(drug14=1)+(drug15=1)+(drug16=1)+ (drug17=1)+(drug18=1)+(drug19=1)+(drug20+1)+(drug21=1);

To find the mean, I have just used the proc means procedure for the new variable:

proc means data=X;

var new_var;

run;

However, I am not sure if this is correct? Is there a better or more efficient way of doing this?

Please let me know if I can clarify further.

Thanks.

Note that your code may create some 0 values for the sum that don't exist in your data if you have missing values for any of those drug variables. That may be your desire but be aware of the difference.

If your drug variable is coded 0/1 and is numeric then you can get that sum as :

new_var = sum (of drug:) ; if that is ALL of the variables that start with drug in the name. If there are others such as DRUG_date those would get used in the shorthand list created with the :

Or declare an array:

array d drug1-drug21;

new_var = sum(of drug(*));

For added entertainment try adding:

new_var2 = mean(of drug(*));

and then include new_var2 in your proc means.

Your current approach would have the overall mean from Proc Means as "mean number of drugs per subject",

wj2 · Posted 12-07-2018 07:27 PM

Hi ballardw,

Thank you for the suggestions. I ran the code I mentioned in my previous reply and I got the output shown below. I'm not sure why 22 values are showing when there is only 21 variables. Any feedback on this would be much appreciated.

new_var	Frequency	Percent	Cumulative Frequency	Cumulative Percent
1	52	1.18	52	1.18
2	439	9.95	491	11.13
3	466	10.56	957	21.69
4	464	10.52	1421	32.21
5	510	11.56	1931	43.77
6	413	9.36	2344	53.13
7	463	10.49	2807	63.62
8	368	8.34	3175	71.96
9	352	7.98	3527	79.94
10	291	6.60	3818	86.54
11	225	5.10	4043	91.64
12	159	3.60	4202	95.24
13	92	2.09	4294	97.33
14	51	1.16	4345	98.48
15	24	0.54	4369	99.03
16	15	0.34	4384	99.37
17	11	0.25	4395	99.61
18	4	0.09	4399	99.71
19	4	0.09	4403	99.80
20	3	0.07	4406	99.86
21	1	0.02	4407	99.89
22	5	0.11	4412	100.00

mkeintz · Posted 12-08-2018 01:42 PM

Well, this problem is begging you to look at the data!!!

data problem;

set have;

where sum(of drug:) >21;

run;

Or better yet (in case an observation has both a -1 and a +2 - which wouldn't produce a sum over 21):

data problem;

where max(of drug:) >1 or min(of drug:)<0;

run;

If Socrates were alive now, perhaps he would issue a corollary to his best-known dictum:

Know Thy Data

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

PeterClemmensen · Posted 12-07-2018 04:34 PM

Please be more specific. If possible, provide some example data and what you want your desired outcome to look like 🙂

The DATA to DATA Step Macro
Blog: SASnrd

Summing categorical variables

Re: Summing categorical variables

Re: Summing categorical variables

Re: Summing categorical variables

Re: Summing categorical variables

Re: Summing categorical variables

Re: Summing categorical variables

Summing categorical variables

Re: Summing categorical variables

Re: Summing categorical variables

Re: Summing categorical variables

Re: Summing categorical variables

Re: Summing categorical variables

Re: Summing categorical variables

Registration is open

SAS Training: Just a Click Away