Solved: Re: Analyzing Multiple Response Variables

_maldini_ · Posted 08-08-2018 03:38 PM

I am seeking help on different approaches to analyzing multiple response variables (I have a dataset from a survey with many questions with responses that are checkboxes ("Check all that apply"). Sample dataset attached.

As a first approach, I am using PROC TABULATE and trying to follow these instructions. I am running into a problem, however. All my counts are the same and the percent is 100%.

PROC TABULATE DATA=mchps.data_working;
	VAR pref_which___1-pref_which___7 record_id;
	
	TABLE 	pref_which___1="Inhalation - Smoking"
			pref_which___2="Inhalation - Vaporizing"
			pref_which___3="Oral - Ingestion"
			pref_which___4="Oral - Sublingual or intra-oral (mucosal) absorption"
			pref_which___5="Topical or transdermal"
			pref_which___6="Rectal, vaginal"
			pref_which___7="Other",
			(n='Count'*f=7. pctn<record_id>='Percent'*f=pctfmt9.) ;
	TITLE "Which methods of administration do they prefer";
RUN;

What am I doing wrong?

Also, I am not very familiar w/ PROC TABULATE, are there other PROCS that would be better suited for this task? Ultimately, I would like to be able to investigate statistically significant differences in these multiple response variables across groups, say gender for example.

Many thanks!

ballardw · Posted 08-08-2018 03:59 PM

N with VAR variables reports how many records have values. Since your example data shows no missing values then expect N to have the number of records.

I suspect that you may want to use SUM instead. The sum of 1/0 coded variables would be the count of YES values.

By the same token the mean would be a percentage. You may want to use a format such as percent8.1 or similar with mean done this way.

Since your record_id is unique for each record you probably don't want to use for much in proc tabulate.

The table statement will support variable lists as well.

I do a lot of tables with Proc Tabulate and 1/0 coded variables like this.

You may want to consider adding labels to the variables in the dataset or using a label statement in the proc so the table syntax is a bit cleaner.

Depending on needs you might also use the box option to place the text of the title into the upper left box in lieu of a title statement.

PROC TABULATE DATA=mchps.data_working;
	VAR pref_which___1-pref_which___7 record_id;
	
	TABLE pref_which___1 - pref_which___7 ,
			(Sum='Count'*f=7. mean='Percent'*f=percent8.1) 
	     /box= "Which methods of administration do they prefer";
   label
      pref_which___1="Inhalation - Smoking"
      pref_which___2="Inhalation - Vaporizing"
      pref_which___3="Oral - Ingestion"
      pref_which___4="Oral - Sublingual or intra-oral (mucosal) absorption"
      pref_which___5="Topical or transdermal"
      pref_which___6="Rectal, vaginal"
      pref_which___7="Other"
  ;
RUN;

View solution in original post

ballardw · Posted 08-08-2018 03:59 PM

N with VAR variables reports how many records have values. Since your example data shows no missing values then expect N to have the number of records.

I suspect that you may want to use SUM instead. The sum of 1/0 coded variables would be the count of YES values.

By the same token the mean would be a percentage. You may want to use a format such as percent8.1 or similar with mean done this way.

Since your record_id is unique for each record you probably don't want to use for much in proc tabulate.

The table statement will support variable lists as well.

I do a lot of tables with Proc Tabulate and 1/0 coded variables like this.

You may want to consider adding labels to the variables in the dataset or using a label statement in the proc so the table syntax is a bit cleaner.

Depending on needs you might also use the box option to place the text of the title into the upper left box in lieu of a title statement.

PROC TABULATE DATA=mchps.data_working;
	VAR pref_which___1-pref_which___7 record_id;
	
	TABLE pref_which___1 - pref_which___7 ,
			(Sum='Count'*f=7. mean='Percent'*f=percent8.1) 
	     /box= "Which methods of administration do they prefer";
   label
      pref_which___1="Inhalation - Smoking"
      pref_which___2="Inhalation - Vaporizing"
      pref_which___3="Oral - Ingestion"
      pref_which___4="Oral - Sublingual or intra-oral (mucosal) absorption"
      pref_which___5="Topical or transdermal"
      pref_which___6="Rectal, vaginal"
      pref_which___7="Other"
  ;
RUN;

_maldini_ · Posted 08-08-2018 05:33 PM

@ballardw Many thanks. Very helpful!

I see a DESCENDING option can be used in the CLASS statement, but do I need to use PROC SORT if I want the VAR variables listed in descending order (Frequency) in the absence of a CLASS variable?

Also, when I put a dichotomous variable in the CLASS statement, the output is unchanged and there are no errors in the log. Any clues?

ballardw · Posted 08-08-2018 06:11 PM

@_maldini_ wrote:

@ballardw Many thanks. Very helpful!

I see a DESCENDING option can be used in the CLASS statement, but do I need to use PROC SORT if I want the VAR variables listed in descending order (Frequency) in the absence of a CLASS variable?

Also, when I put a dichotomous variable in the CLASS statement, the output is unchanged and there are no errors in the log. Any clues?

I would expect your dichotomous variables as class to create two rows of values, one for the 0 values and another for the 1. Then the N and PCTN may work but Sum and mean shouldn't. The CLASS option Order=Freq will sort responses by descending order of N within each class variable. That just means that the order of the 0 and 1 value rows may change for each of the variables. Show the code with the class variable that doesn't change the appearance.

There are different approaches depending on exactly what you want the table to look like.

They pretty much all involve summarizing the existing data in some way and then sorting that result on the N or Sum and then using the ORDER=DATA option.

If you are doing that for a large number of these variables then that might be worth it. If there is only the one set look at the result and then put the variables in the table statement in the order you would like and rerun the proc tabulate.

_maldini_ · Posted 08-09-2018 01:23 PM

@ballardw

While the counts in the table are correct, the percentages are not. For example, in row 1 (See embedded image) 31 is 12.8% of the total, 242, not 24.2%. Is this because I am using the MEAN function?

Also, I think I figured out the situation w/ the CLASS variable, but I still would love for the table to be sorted in descending order and have a Total as the final row in each column (See uploaded image). Yes, I have a lot of these variables!

As far as the sorting goes, I see where you say I could sort "on the N or Sum" but those are not variables in the dataset, correct? Would I have to add them first before I can sort on them?

PROC TABULATE DATA=mchps.data_working;
	CLASS gender; 
	VAR pref_which___1-pref_which___7 record_id;
	
	TABLE (pref_which___1 - pref_which___7)*(Sum='Count'*f=7. mean='Percent'*f=percent8.1) , (gender ALL) ;
	     
   LABEL
      pref_which___1="Inhalation - Smoking"
      pref_which___2="Inhalation - Vaporizing"
      pref_which___3="Oral - Ingestion"
      pref_which___4="Oral - Sublingual or intra-oral (mucosal) absorption"
      pref_which___5="Topical or transdermal"
      pref_which___6="Rectal, vaginal"
      pref_which___7="Other";
      
    TITLE "Which methods of administration do they prefer?";
    FORMAT gender gender_fmt.;
    LABEL gender = "Gender";
RUN;

Thank you!!

ballardw · Posted 08-09-2018 02:13 PM

@_maldini_ wrote:

@ballardw

While the counts in the table are correct, the percentages are not. For example, in row 1 (See embedded image) 31 is 12.8% of the total, 242, not 24.2%. Is this because I am using the MEAN function?

Also, I think I figured out the situation w/ the CLASS variable, but I still would love for the table to be sorted in descending order and have a Total as the final row in each column (See uploaded image). Yes, I have a lot of these variables!

As far as the sorting goes, I see where you say I could sort "on the N or Sum" but those are not variables in the dataset, correct? Would I have to add them first before I can sort on them?

Thank you!!

The approach I showed calculates a row percentage. You did not specify which percentage you might be interested in: Row, Column, Page or table. It was not possible from your original code which you were attempting. Given the example data I picked what made the most sense to me.

Also since your initial post indicated 170 records (the N value) I am not sure where your are getting 242 as a denominator. The example CSV you posted only had 172 also.

Is gender missing for any of your records? Tabulate will remove any record with missing values of class variables unless the "/ missing" option is used on the class statement. And if the actual response is missing (does not have a 0 or 1 value) the calculation will exclude them as well just as proc means does for calculating a mean.

Perhaps you need to provide some example data in the form of a data step so we have a common data set to run code against. It does not need to be very large maybe about 15 rows or so, with 2 or 3 of the analysis variables. Something like:

data have;
   input gender $ q1 q2;
datalines;
Male  0  0
Female 0 1
Female 1 0
Male 1 0
;
run;

The sort order means you need to summarize your data and have a single variable that contains the row header. And if you are creating columns based on gender and total which order do you want? The All, female or male?

_maldini_ · Posted 08-09-2018 03:15 PM

@ballardw Thanks for staying w/ me. Sorry for the confusion!

I see. Apologies for not being more clear. Is there a way to specify which percentages to display? I may want to be flexible down the road.

Yes, 172 observations, but each variable is a response/answer to a question. The responses to the questions are checkboxes ("Check all that apply" See image below). The percentages I am seeking are percentages of responses, as opposed to the percentage of respondents. The number of responses will be greater than the number of respondents.

I would like to be able to make statements about the frequency of responses, like: The most preferred method of administration was "Oral - Sublingual or intra-oral (mucosal) absorption)", 60 responses out of a total 249 of responses (242 in my last post. Sorry!), 24.1%.

I would then like to be able to compare these frequencies across different CLASS variables, like gender, and ultimately perform hypotheses tests.

Yes.

I have uploaded a new dataset with gender.

I want the order to apply to the rows. In other words, I would like pref_which___1 - pref_which___7 to be sorted by count/sum (i.e. pref_which___4, then pref_which___3, etc.). And, ideally, add a Total as a row as shown in the image of the table above.

Thank you again for your time and expertise here.

_maldini_ · Posted 08-15-2018 01:21 PM

@ballardw

I am still trying to:

1. Sort the PROC TABULATE results by descending frequency.

2. Sum the individual row counts into a final row count representing the total (See image).

You said: "The sort order means you need to summarize your data and have a single variable that contains the row header."

I'm not sure I follow. A single variable that contains the row header? Are you referring to creating a COUNT variable and outputting the total count to a new variable and then merging that back into the dataset?

ballardw · Posted 08-15-2018 04:19 PM

Tabulate will not sum multiple variables. It isn't designed to. You have to create single variable to sum or count to get a total count.

One way:

data example;
   input obs x1 x2 x3;
datalines;
1 1 0 0
2 0 1 1
3 1 1 0
4 0 1 0
5 1 0 1
6 0 1 1
;
run;
Proc transpose data=example
   out=trans;
by obs;
run;

proc tabulate data=trans;
  where col1=1;
  class _name_ /order=freq;
  table _name_ all='Total',
        n='n'*f=best6. 
      /
   ;
run;

But if there are other variables to consider the "where" may not be useable but is needed to reduce the counts for the variables.

Create format for the _name_ variable to get better table results than the variable name. Or assign a better label to the variable and use the _label_ instead of _name_ variable.

data example;
   input obs x1 x2 x3;
   label x1='First'
         x2='Second'
         x3='Third'
   ;
datalines;
1 1 0 0
2 0 1 1
3 1 1 0
4 0 1 0
5 1 0 1
6 0 1 1
;
run;
Proc transpose data=example
   out=trans;
by obs;
run;

proc tabulate data=trans;
  where col1=1;
  class _label_ /order=freq;
  label _label_='Response';
  table _label_ all='Total',
        n='n'*f=best6.      /
   ;
run;

Or use col1 as a var and use sum on col1. HOWEVER since you are using the filter on the value to get the N in order the percents can't come out correct.

This may help for some things:

data example;
   input obs x1 x2 x3;
   label x1='First'
         x2='Second'
         x3='Third'
   ;
datalines;
1 1 0 0
2 0 1 1
3 1 1 0
4 0 1 0
5 1 0 1
6 0 1 1
;
run;
Proc transpose data=example
   out=trans;
by obs;
run;

proc tabulate data=trans;
  where col1=1;
  class _label_ /order=freq;
  label _label_='Response';
  table _label_ all='Total',
        n='n'*f=best6. colpctn='%'
      /
   ;
run;

proc summary data=trans;
   class _label_;
   var col1;
   output out=transsum n= sum=/autoname;
run;
proc sort data=transsum;
   by descending _type_ descending col1_sum ;
run;

data display;
  set transsum;
  if missing(_label_) then do;
      _label_ ='Total';
      percent= 100;
  end;
  else percent = 100*(col1_sum/col1_n);
run;

proc print data=display noobs label;
   var _label_ col1_sum percent;
   label _label_='Response'
         col1_sum='Number selected'
         percent='Percent of Respondents'
   ;
   format col1_sum best6. percent f5.1;
run;

but if you want a column percent you'll need to merge the to total sum with all of the records in display to calculate the summary.

rangarat · Posted 02-07-2019 09:22 AM

On a related note, how would you handle the following notes when running the above code?

NOTE: Character values have been converted to numeric values at the places given by:
(Line):(Column).
133:9 134:9 135:9 136:9 137:9 138:9 139:9 140:9 141:9 142:9 143:9
144:9 145:9 146:9 147:9 148:9 149:9 150:9 151:9 152:9 153:9 154:9
155:9 156:9 157:9 158:9 159:9 160:9 161:9 162:9 163:9 164:9
NOTE: Invalid numeric data, Symptoms_prior_to_dx='15 20 22' , at line 133 column 9

WARNING: Limit set by ERRORS= option reached. Further errors of this type will not be printed

Ron_MacroMaven · Posted 08-08-2018 04:07 PM

I wrote a couple of papers about my solutions to Check-all-that-apply Q in surveys.

Macro_CheckAll_for_Check_All_That_Apply_Data

Macro_ShowComb_Combinations_from_Check_All_That_Apply_Data

Processing_Check-All-That-Apply

hmmm, well I had a page about converting vertical data to check-all but it is still on my to.do list.

for the above macros you'll need one or the other my macro array v1 (call execute) or v2 (sql)

Macro_Array_v1

Macro_Array_v2

Check-all returns a data set with rows of the positive response to the items in the Q.

if you have 5 response (e.g. A--E) then you ought to have 5 rows

but if no one selected one item D then you'll have only rows A--C,E

The sum of percentages may be greater than 100%! <---<<<

Show-Combination returns a data set with rows with each combination of the items:

A,B ...

B,C,D ...

C,E ...

and this data is an output from proc frequency, the sum of percentages is equal to 100%

This problem is non-trivial; expect to spend a lot of time coming to understand the answers.

And, oh, by the way, sometimes, the order in which the choices are laid out may determine the responses!

There are a lot of papers written about this topic

Multiple-Choice-Questions (MCQ)

Multiple-Response-Questions (MRQ)

Check-All-That-Apply (CATA)

I suggest you get these two routines running for your data.

They are a great way to begin your data review!

hth

Ron Fehd macro maven

Registration is open

SAS Training: Just a Click Away