DATA Step, Macro, Functions and more

Analyzing Multiple Response Variables

Accepted Solution Solved
Reply
Regular Contributor
Posts: 207
Accepted Solution

Analyzing Multiple Response Variables

I am seeking help on different approaches to analyzing multiple response variables (I have a dataset from a survey with many questions with responses that are checkboxes ("Check all that apply"). Sample dataset attached.

 

As a first approach, I am using PROC TABULATE and trying to follow these instructions. I am running into a problem, however. All my counts are the same and the percent is 100%.SAS_Studio.jpg

PROC TABULATE DATA=mchps.data_working;
	VAR pref_which___1-pref_which___7 record_id;
	
	TABLE 	pref_which___1="Inhalation - Smoking"
			pref_which___2="Inhalation - Vaporizing"
			pref_which___3="Oral - Ingestion"
			pref_which___4="Oral - Sublingual or intra-oral (mucosal) absorption"
			pref_which___5="Topical or transdermal"
			pref_which___6="Rectal, vaginal"
			pref_which___7="Other",
			(n='Count'*f=7. pctn<record_id>='Percent'*f=pctfmt9.) ;
	TITLE "Which methods of administration do they prefer";
RUN;

What am I doing wrong?

 

Also, I am not very familiar w/ PROC TABULATE, are there other PROCS that would be better suited for this task? Ultimately, I would like to be able to investigate statistically significant differences in these multiple response variables across groups, say gender for example.

 

Many thanks!

 


Accepted Solutions
Solution
5 hours ago
Super User
Posts: 13,886

Re: Analyzing Multiple Response Variables

N with VAR variables reports how many records have values. Since your example data shows no missing values then expect N to have the number of records.

 

I suspect that you may want to use SUM instead. The sum of 1/0 coded variables would be the count of YES values.

By the  same token the mean would be a percentage. You may want to use a format such as percent8.1 or similar with mean done this way.

 

Since your record_id is unique for each record you probably don't want to use for much in proc tabulate.

The table statement will support variable lists as well.

 

I do a lot of tables with Proc Tabulate and 1/0 coded variables like this.

You may want to consider adding labels to the variables in the dataset or using a label statement in the proc so the table syntax is a bit cleaner.

Depending on needs you might also use the box option to place the text of the title into the upper left box in lieu of a title statement.

 

PROC TABULATE DATA=mchps.data_working;
	VAR pref_which___1-pref_which___7 record_id;
	
	TABLE pref_which___1 - pref_which___7 ,
			(Sum='Count'*f=7. mean='Percent'*f=percent8.1) 
	     /box= "Which methods of administration do they prefer";
   label
      pref_which___1="Inhalation - Smoking"
      pref_which___2="Inhalation - Vaporizing"
      pref_which___3="Oral - Ingestion"
      pref_which___4="Oral - Sublingual or intra-oral (mucosal) absorption"
      pref_which___5="Topical or transdermal"
      pref_which___6="Rectal, vaginal"
      pref_which___7="Other"
  ;
RUN;

View solution in original post


All Replies
Solution
5 hours ago
Super User
Posts: 13,886

Re: Analyzing Multiple Response Variables

N with VAR variables reports how many records have values. Since your example data shows no missing values then expect N to have the number of records.

 

I suspect that you may want to use SUM instead. The sum of 1/0 coded variables would be the count of YES values.

By the  same token the mean would be a percentage. You may want to use a format such as percent8.1 or similar with mean done this way.

 

Since your record_id is unique for each record you probably don't want to use for much in proc tabulate.

The table statement will support variable lists as well.

 

I do a lot of tables with Proc Tabulate and 1/0 coded variables like this.

You may want to consider adding labels to the variables in the dataset or using a label statement in the proc so the table syntax is a bit cleaner.

Depending on needs you might also use the box option to place the text of the title into the upper left box in lieu of a title statement.

 

PROC TABULATE DATA=mchps.data_working;
	VAR pref_which___1-pref_which___7 record_id;
	
	TABLE pref_which___1 - pref_which___7 ,
			(Sum='Count'*f=7. mean='Percent'*f=percent8.1) 
	     /box= "Which methods of administration do they prefer";
   label
      pref_which___1="Inhalation - Smoking"
      pref_which___2="Inhalation - Vaporizing"
      pref_which___3="Oral - Ingestion"
      pref_which___4="Oral - Sublingual or intra-oral (mucosal) absorption"
      pref_which___5="Topical or transdermal"
      pref_which___6="Rectal, vaginal"
      pref_which___7="Other"
  ;
RUN;
Regular Contributor
Posts: 207

Re: Analyzing Multiple Response Variables

@ballardw Many thanks. Very helpful!

 

I see a DESCENDING option can be used in the CLASS statement, but do I need to use PROC SORT if I want the VAR variables listed in descending order (Frequency) in the absence of a CLASS variable?

 

Also, when I put a dichotomous variable in the CLASS statement, the output is unchanged and there are no errors in the log. Any clues?

Super User
Posts: 13,886

Re: Analyzing Multiple Response Variables


@jcorroon wrote:

@ballardw Many thanks. Very helpful!

 

I see a DESCENDING option can be used in the CLASS statement, but do I need to use PROC SORT if I want the VAR variables listed in descending order (Frequency) in the absence of a CLASS variable?

 

Also, when I put a dichotomous variable in the CLASS statement, the output is unchanged and there are no errors in the log. Any clues?


I would expect your dichotomous variables as class to create two rows of values, one for the 0 values and another for the 1. Then the N and PCTN may work but Sum and mean shouldn't. The CLASS option Order=Freq will sort responses by descending order of N within each class variable. That just means that the order of the 0 and 1 value rows may change for each of the variables. Show the code with the class variable that doesn't change the appearance.

 

There are different approaches depending on exactly what you want the table to look like.

They pretty much all involve summarizing the existing data in some way and then sorting that result on the N or Sum and then using the ORDER=DATA option.

If you are doing that for a large number of these variables then that might be worth it. If there is only the one set look at the result and then put the variables in the table statement in the order you would like and rerun the proc tabulate.

Regular Contributor
Posts: 207

Re: Analyzing Multiple Response Variables

@ballardw

 

While the counts in the table are correct, the percentages are not. For example, in row 1 (See embedded image) 31 is 12.8% of the total, 242, not 24.2%. Is this because I am using the MEAN function? 

 

Also, I think I figured out the situation w/ the CLASS variable, but I still would love for the table to be sorted in descending order and have a Total as the final row in each column (See uploaded image). Yes, I have a lot of these variables!

 

As far as the sorting goes, I see where you say I could sort "on the N or Sum" but those are not variables in the dataset, correct? Would I have to add them first before I can sort on them? 

 

PROC TABULATE DATA=mchps.data_working;
	CLASS gender; 
	VAR pref_which___1-pref_which___7 record_id;
	
	TABLE (pref_which___1 - pref_which___7)*(Sum='Count'*f=7. mean='Percent'*f=percent8.1) , (gender ALL) ;
	     
   LABEL
      pref_which___1="Inhalation - Smoking"
      pref_which___2="Inhalation - Vaporizing"
      pref_which___3="Oral - Ingestion"
      pref_which___4="Oral - Sublingual or intra-oral (mucosal) absorption"
      pref_which___5="Topical or transdermal"
      pref_which___6="Rectal, vaginal"
      pref_which___7="Other";
      
    TITLE "Which methods of administration do they prefer?";
    FORMAT gender gender_fmt.;
    LABEL gender = "Gender";
RUN;

 

 table.jpg

 

 

Thank you!!

Super User
Posts: 13,886

Re: Analyzing Multiple Response Variables

[ Edited ]

@jcorroon wrote:

@ballardw

 

While the counts in the table are correct, the percentages are not. For example, in row 1 (See embedded image) 31 is 12.8% of the total, 242, not 24.2%. Is this because I am using the MEAN function? 

 

Also, I think I figured out the situation w/ the CLASS variable, but I still would love for the table to be sorted in descending order and have a Total as the final row in each column (See uploaded image). Yes, I have a lot of these variables!

 

As far as the sorting goes, I see where you say I could sort "on the N or Sum" but those are not variables in the dataset, correct? Would I have to add them first before I can sort on them? 

 

  

 

Thank you!!


The approach I showed calculates a row percentage. You did not specify which percentage you might be interested in: Row, Column, Page or table. It was not possible from your original code which you were attempting. Given the example data I picked what made the most sense to me.

Also since your initial post indicated 170 records (the N value) I am not sure where your are getting 242 as a denominator. The example CSV you posted only had 172 also.

Is gender missing for any of your records? Tabulate will remove any record with missing values of class variables unless the "/ missing" option is used on the class statement. And if the actual response is missing (does not have a 0 or 1 value) the calculation will exclude them as well just as proc means does for calculating a mean.

 

Perhaps you need to provide some example data in the form of a data step so we have a common data set to run code against. It does not need to be very large maybe about 15 rows or so, with 2 or 3 of the analysis variables. Something like:

data have;
   input gender $ q1 q2;
datalines;
Male  0  0
Female 0 1
Female 1 0
Male 1 0
;
run;

 

 The sort order means you need to summarize your data and have a single variable that contains the row header. And if you are creating columns based on gender and total which order do you want? The All, female or male?

Regular Contributor
Posts: 207

Re: Analyzing Multiple Response Variables

[ Edited ]

@ballardw Thanks for staying w/ me. Sorry for the confusion!

 

<The approach I showed calculates a row percentage. You did not specify which percentage you might be interested>

 

I see. Apologies for not being more clear. Is there a way to specify which percentages to display? I may want to be flexible down the road.

 

<The example CSV you posted only  had 172 also.>

 

Yes, 172 observations, but each variable is a response/answer to a question. The responses to the questions are checkboxes ("Check all that apply" See image below). The percentages I am seeking are percentages of responses, as opposed to the percentage of respondents. The number of responses will be greater than the number of respondents. 

pref checkbox.jpg

 

I would like to be able to make statements about the frequency of responses, like: The most preferred method of administration was "Oral - Sublingual or intra-oral (mucosal) absorption)", 60 responses out of a total 249 of responses (242 in my last post. Sorry!), 24.1%.

 

I would then like to be able to compare these frequencies across different CLASS variables, like gender, and ultimately perform hypotheses tests.  

 

<Is gender missing for any of your records?>

 

Yes. 

 

<Perhaps you need to provide some example data>

 

I have uploaded a new dataset with gender.

 

<which order do you want?>

 

I want the order to apply to the rows. In other words, I would like pref_which___1 - pref_which___7 to be sorted by count/sum (i.e. pref_which___4, then pref_which___3, etc.). And, ideally, add a Total as a row as shown in the image of the table above.

 

Thank you again for your time and expertise here.

 

 

Regular Contributor
Posts: 235

Re: Analyzing Multiple Response Variables

I wrote a couple of papers about my solutions to Check-all-that-apply Q in surveys.

 

Macro_CheckAll_for_Check_All_That_Apply_Data

 

Macro_ShowComb_Combinations_from_Check_All_That_Apply_Data

 

Processing_Check-All-That-Apply

 

hmmm, well I had a page about converting vertical data to check-all but it is still on my to.do list.

 

for the above macros you'll need one or the other my macro array v1 (call execute) or v2 (sql)

Macro_Array_v1

 

Macro_Array_v2

 

Check-all returns a data set with rows of the positive response to the items in the Q.

if you have 5 response (e.g. A--E) then you ought to have 5 rows

but if no one selected one item D then you'll have only rows A--C,E 

 

The sum of percentages may be greater than 100%!     <---<<<

 

Show-Combination returns a data set with rows with each combination of the items:

A,B  ...

B,C,D ... 

C,E   ...

 

and this data is an output from proc frequency, the sum of percentages is equal to 100%

 

This problem is non-trivial; expect to spend a lot of time coming to understand the answers.

 

And, oh, by the way, sometimes, the order in which the choices are laid out may determine the responses!

 

There are a lot of papers written about this topic

Multiple-Choice-Questions (MCQ)

Multiple-Response-Questions (MRQ)

Check-All-That-Apply (CATA)

 

I suggest you get these two routines running for your data.

They are a great way to begin your data review!

 

hth

Ron Fehd  macro maven

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 7 replies
  • 90 views
  • 4 likes
  • 3 in conversation