BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Ming
Calcite | Level 5

Hi,

I have a relative confusing experiment on trying to quantify fitness level of an individual based on different types of exercises.

50 subjects total, each subject is categorized into 3 groups (control/mov/fit) and did 20 different types of exercise (vertical jump, hand grip strength, ....) to be used to explain the variables in experiment 2.

once each subject have completed the 20 initial exercises, they proceed to experiment 2 where each subject will perform 8 different tasks (symmetric pull, asymmetric pull,....) and 3 spine measurements (compression and 2 shears) are recorded.  The I want to ask is how does the 20 exercises explain the 3 spine measurements?

I'm unsure as to what methods to use?  I've been told PCA (not sure what this is), multiple regression (stepwise), or multivariate regression (gets tricky with multi-level (tasks)) or maybe other methods?!?  I'm not even sure if the number of the 20 IV (exercises) differs between tasks for the DV (compression, shears).  For example, for symmetric pull, there maybe 7 exercises that explains the 3 DV well meanwhile for asymmetric pull, there might be only 3 exercises.  Or maybe there are 6 exercises that explains the 3 DV for all 8 tasks.

any suggestion on how to tackle this problem will be greatly appreciated.

attached is a sample data

thanks.

ming

1 ACCEPTED SOLUTION

Accepted Solutions
1zmm
Quartz | Level 8

Your PROC GLM code looks alright, except that I've forgottend what the variable, GROUP, in the PROC GLM CLASS statement referes to.  For each of the tasks within each king of exercise, this PROC GLM code will perform multivariate analysis of variance (MANOVA).

To have all the exercises included together, I'd change the syntax of the PROC GLM "paragraph" to the following:

  proc glm data=long;

      class task exercise;

      model std_comp std_shr1 std_shr2 = task(exercise) / solution;

      manova h=_all_ / printe printh summary;

  run;

  quit;

This will perform MANOVAs for tasks nested within each exercise.  This is equivalent to having an interaction term, TASK*EXERCISE, without any "main-effect" terms for TASK and EXERCISE.  This

so-called "cell-means" model assumes that the "effect" of a specific TASK on the three dependent variables may differ by the kind of EXERCISE and that the "effect" of a specific EXERCISE on the three dependent variables may differ by the kind of TASK.

An alternative model would include main-effect independent variable terms for TASK and EXERCISE in the MODEL statement as well as their interaction.  Then, the interaction term, TASK*EXERCISE, would be interpreted as showing additional effects on the three dependent variables after the effects of the main-effect terms were accounted for:

   model std_comp std_shr1 std_shr2=task exercise task*exercise / solution;

However, the original description of your study seemed to fit the nested cell-means model more closely than this latter "saturated" model.

You could do doubly multivariate repeated measures using proc mixed.  I don't know why you received the error message about non-convergence because there are several reasons why this may have happened (see the PROC MIXED documentation.  For such an analysis, you do not need to standardize the dependent variables.

I would change your SAS syntax to the following:

* For each king of exercise and each dependent variable, creata a separate dependent variable;

*  with the same name, RESPONSE, but indexed by the variable, VAR.;

data long;

    length var $ 12;

    set wide;

    array e{20} exerc1-exerc20;

    array dv{3} comp shear1 shear2;

    array varlist{3} $ 12 _temporary_ ("comp" "shear1" "shear2");

    do i=1 to 20;

         exercise=e{i};

         do j=1 to 3;

              response=dv{j};

              var=varlist{j};

              output long;

         end;

     end;      

     drop i j comp shear1 shear2 exerc1-exerc20;

run;

proc sort data=long;

     by subject exercise task var;

run;

  proc mixed data=long;

      class subject exercise task var;

      model response = var task exercise var*task var*exercise task*exercise var*task*exercise

           / solution ddfm=kenwardroger;

      repeated var task / type=un@un subject=subject rcorr;

  run;

  quit;

This saturated proc mixed model may not converge either.  You may have to reduce the number of independent variable terms and to simplify the variance-covariance structure.

View solution in original post

26 REPLIES 26
1zmm
Quartz | Level 8

If the three dependent variables are in the same units, first consider a multivariate analysis of variance (MANOVA) on these three dependent variables of the eight different tasks by each of the twenty exercises using PROC GLM (see the documentation):   proc sort;     by exercise;   run;   proc glm;     by exercise;     class task;     model compression shear1 shear2=task / solution;     manova h=_all_ / printe printh summary;   run;   quit; This would look at the "effects" of the eight tasks on the three dependent variables, taking into account the correlation among these dependent variables, for each of the twenty exercises taken separately.  You can specify, using PROC FORMAT, the task that is the "reference" category among all the eight tasks.  For example, to specify the "asymmetric pull" as the reference category of tasks, use the following (or similar syntax):   proc format;     value taskfm 1="Symmetric pull"                   2="} Asymmetric pull"                     etc.                   8="Last task";   run; Then use the option, ORDER=FORMATTED, on the PROC GLM statement, and add the following statement,   format task taskfm.; to the PROC GLM "paragraph". Another possibility would be to "nest" task within exercise instead of looking at the effects of the tasks on the dependent variables separately.  This nesting assumes that each combination of test and exercise (up to 160 of them) are considered separately and similarly:   proc glm order=formatted;     class task exercise;     model compression shear1 shear2=task(exercise) / solution;     manova h=_all_ / printe printh summary;     format task taskfm. exercise exercisef.;   run;   quit; By the way, PCA is the abbreviation for principal components analysis, which tries to identify linear combinations of numerical, interval-ratio level variables that best "explain" the combined variance of these variables.  PCA attempts to reduce the number of variables by using fewer principal components as independent variables in your analysis.  Since your independent variables, exercise and task, are nominal variables, PCA is not relevant for them.  However, you could use PCA to develop a single component to represent your three, interval-ratio level dependent variables; whether this is worthwhile, you'd have to decide.

Ming
Calcite | Level 5

Hi,

Thanks for the suggestion!  just a question, if the 3 DVs does not have the same units (unit for compression is not the same as shear), does that mean I can not use MANOVA?

thanks.

ming

1zmm
Quartz | Level 8

Unfortunately, you should use the same units when you use MANOVA.  One of the advantages of MANOVA over that of separate ANOVAs is that MANOVA accounts for the correlation among related DVs.  One subterfuge you can try is to create standardized values for your DVs by subtracting the mean DV from the individual value of the DV and dividing this difference by the standard deviation of the DV.  Then, for all the DVs, these standardized DVs are unit-less with a mean of zero and a variance of 1.  The problem with this "solution" is the interpretation of the results in terms of standardized units.  An alternative solution is only to analyze multiple DVs with the same units using MANOVA and to analyze single DVs with different units using ANOVA.

Ming
Calcite | Level 5

Hi,

So here is my code according to your first suggestion (manova with each of the exercise separately) with standardized DV (comp and 2 shear variables).

data wide;

   input task$ subject$ group$ time$ height weight std_comp std_shr1 std_shr2 exerc1-exerc20;

data long; set long;

   array e(20) exerc1-exerc20;

   do i = 1 to 20;

      exercise = i;

      exerc = e(i);

      output;

   end;

   drop i exerc1-exerc20;

proc sort data=long;

   by exercise;

proc glm data=long; by exercise;

   class task group;

   model std_comp std_shr1 std_shr2 = task /solution;

   manova h=_all_ /printe printh summary;

run;

quit;

does that look right?  how do the same thing but have all the exercises be included together (ie. not do manova as separate for each exercise)?  How about doing doubly multivariate repeated measures using proc mixed?  I attempted doing that and I got an error message "did not converge" and stopped.

here is my proc mixed code: (I left the DVs as their usual values, not standardized)  should I standardize them?

data wide;

   input task$ subject$ group$ time$ height weight comp shear1 shear2 exerc1-exerc20;

data long; set wide;

   length var $12.;

   response = comp; var = 'comp'; output;

   response = shear1; var = 'shear1'; output;

   response = shear2; var = 'shear2'; output;

   drop comp shear1 shear2;

proc mixed data=univariate;

   class task var subject group;

   model response = task group height weight exerc1-exerc20;

   repeated var task / type=un@ar(1) subject=subject;

run;

quit;



thanks for your help!


ming

1zmm
Quartz | Level 8

Your PROC GLM code looks alright, except that I've forgottend what the variable, GROUP, in the PROC GLM CLASS statement referes to.  For each of the tasks within each king of exercise, this PROC GLM code will perform multivariate analysis of variance (MANOVA).

To have all the exercises included together, I'd change the syntax of the PROC GLM "paragraph" to the following:

  proc glm data=long;

      class task exercise;

      model std_comp std_shr1 std_shr2 = task(exercise) / solution;

      manova h=_all_ / printe printh summary;

  run;

  quit;

This will perform MANOVAs for tasks nested within each exercise.  This is equivalent to having an interaction term, TASK*EXERCISE, without any "main-effect" terms for TASK and EXERCISE.  This

so-called "cell-means" model assumes that the "effect" of a specific TASK on the three dependent variables may differ by the kind of EXERCISE and that the "effect" of a specific EXERCISE on the three dependent variables may differ by the kind of TASK.

An alternative model would include main-effect independent variable terms for TASK and EXERCISE in the MODEL statement as well as their interaction.  Then, the interaction term, TASK*EXERCISE, would be interpreted as showing additional effects on the three dependent variables after the effects of the main-effect terms were accounted for:

   model std_comp std_shr1 std_shr2=task exercise task*exercise / solution;

However, the original description of your study seemed to fit the nested cell-means model more closely than this latter "saturated" model.

You could do doubly multivariate repeated measures using proc mixed.  I don't know why you received the error message about non-convergence because there are several reasons why this may have happened (see the PROC MIXED documentation.  For such an analysis, you do not need to standardize the dependent variables.

I would change your SAS syntax to the following:

* For each king of exercise and each dependent variable, creata a separate dependent variable;

*  with the same name, RESPONSE, but indexed by the variable, VAR.;

data long;

    length var $ 12;

    set wide;

    array e{20} exerc1-exerc20;

    array dv{3} comp shear1 shear2;

    array varlist{3} $ 12 _temporary_ ("comp" "shear1" "shear2");

    do i=1 to 20;

         exercise=e{i};

         do j=1 to 3;

              response=dv{j};

              var=varlist{j};

              output long;

         end;

     end;      

     drop i j comp shear1 shear2 exerc1-exerc20;

run;

proc sort data=long;

     by subject exercise task var;

run;

  proc mixed data=long;

      class subject exercise task var;

      model response = var task exercise var*task var*exercise task*exercise var*task*exercise

           / solution ddfm=kenwardroger;

      repeated var task / type=un@un subject=subject rcorr;

  run;

  quit;

This saturated proc mixed model may not converge either.  You may have to reduce the number of independent variable terms and to simplify the variance-covariance structure.

Ming
Calcite | Level 5

Hi,

Thanks for the suggestion on the nested cell-means model.  I think that is the proper model based on your description.  The ANOVA table for each DVs (comp, shear1, shear2) showed p<.0001.  Looking at the parameter estimates for each combination of task(exercise) for each DVs, there are some p>.05 and many p<.05.  Is there a way to determine which exercise(s) can significantly explain the variance of the DVs?  Since the parameter estimates shows the task(exercise) above and over the other IVs and F-test showed significance, doesn't that mean I need to have all IVs (task(exercise)) in the model?  manova test results (wilki, pilal...) all showed p<.0001, so what does this mean?  all task(exercise) are important in the model, so I can't trim down the 20 exercises to a smaller number?

thanks a lot for your help!

ming

1zmm
Quartz | Level 8

The ANOVA models for each DV show that some TASK(EXERCISE) combinations statistically significantly predicted individual DVs.  The significance probabilities for the parameter estimates for these combinations indicate which of these combinations statistically signifincantly predicted these DVs.  These combinations are the likely ones to analyze further.  However, it may be possible that some exercises "affected" some DVs only through some but not all tasks and that other exercises affected the same DVs through the same tasks or through other tasks.  Thus, one exercise may not be better than another in predicting a specific DV. The R-squared statistic is printed with each ANOVA table.  Sometimes the quantity, 1 - Wilks' lambda, is interpreted as a multivariate counterpart for the R-squared statistic:  In this example, this quantity would "explain" the proportion of the variance of the three DVs "due to" the TASK(EXERCISE) combination.  However, in small samples, this quantity may be biased and misleading. The statistically significant MANOVA results for the TEST-EXERCISE combinations indicate that the pattern of means of the three DVs differed statistically significantly across those specific TEST(EXERCISE) combinations.  You might try plotting the means of these DVs by the different combinations to show where they might differ. The partial correlation matrix from the ERROR SSCP matrix shows the magnitude and the significance probability of the correlations between pairs of the dependent variables after accounting for the TEST(EXERCISE) combination. Since I have not seen your program's output, it may be that some exercises significantly predict some of your DVs only when combined with specific tasks but not other DVs and that other exercises significantly predict the same DVs when combined with either the same tasks or with other tasks.  So, you may not be able to trim down the number of exercises.

Ming
Calcite | Level 5

Hi,

After sorting out the significant probabilities for the parameter estimates for these combination, the result showed these task(exercise) not significant for the DVs.

For comp, ASYM(all 20 exercises), CHOP(all 20 exercises), and SYMM(all 20 exercises)

For shear1, HOH(all 20 exercises) and SYMM(all 20 exercises)

For shear2, PUSH(all 20 exercises) and SYMM(all 20 exercises)

looks like it is very task dependent, for task SYMM, no exercises is useful to explain the variance in the 3 DVs and depending on DVs, different task(s) does not help in explaining the variance.  So there is no eliminating the 20 exercises.  Am I on the right track?

Here are the interaction plots of the 3 DVs.  I'm not sure what to make out of these.

comp.png

shear1.png

shear2.png

as for MANOVA results, the Wilki's Lamba is 0.03786, so this would mean that task(exercise) explains ~4% of variance of the 3 DVs.  that's not very much.

manova.png

thanks for your help!

ming

1zmm
Quartz | Level 8

Actually, the multivariate statistic corresponding to the R-squared statistic ("proportion of variance explained") is 1 - Wilk's lambda, so that the 180 [=9 tasks * 20 exercises] TASK(EXERCISE) combinations explain slightly more than 96% of the variability in the dependent variables. However, you are correct that it is difficult to understand what is going on with the results in this experiment.  You are also correct that none of the exercises in the SYMM task appear to affect any of the three dependent variables.  None of the exercises for a few of the other tasks appear to affect specific dependent variables. Can you generate interaction plots with the dependent variable on the Y-axis and the different exercises (instead of the tasks) on the X-axis?  Perhaps such plots will provide a different perspective on the relationships between the TASK(EXERCISE) combinations and these dependent variables.

Ming
Calcite | Level 5

The interaction plots with exercise on X-axis will give you all zeros for all exercise (ie. a horizontal line at 0) because the exercise doesn't change the DVs, it's the tasks that changes DVs.

Ming
Calcite | Level 5

Hi,

I'm also trying the PROC MIXED doubly multivariate approach, if I use your code above, SAS will run for over 30min and showed "insufficient memory".  So I deleted "exercise" from the class statement and it runs much faster but with the error msessage...

NOTE: An infinite likelihood is assumed in iteration 0 because of a nonpositive definite estimated R matrix for subject S02.

here are the only changes I made...

proc sort data=long;

   by subject task var;

proc mixed data=long;

   class subject task var;

   model response = var task|exercise /solution ddfm=kenwardroger;

   repeated var task / type=un@un subject=subject rcorr;

run;

thanks for your help!

ming

1zmm
Quartz | Level 8

Deleting EXERCISE from the CLASS statement means that SAS interprets the TASK|EXERCISE "independent variable" in the MODEL statement as consisting of a nominal variable, TASK; a continuous variable, EXERCISE; and an interaction term between a nominal variable, TASK, and a continuous variable, EXERCISE.  Since EXERCISE is not a continuous variable but a nominal variable, this model does not make any sense even if it runs much faster than the original model with EXERCISE as a nominal variable.  The PROC MIXED documentation describes several techniques to reduce the running time.  One possibility is to  include changing the variance-covariance matrix TYPE in the REPEATED statement from TYPE=UN@UN to TYPE=UN@CS, which will reduce the number of parameters to estimate for this matrix at the expense of assuming a single compound symmetry (CS) parameter to model the variability in the TASKs.  The second possibility is to analyze your data in pieces using a BY-variable (for example, BY EXERCISE), though this has its own problems.

Ming
Calcite | Level 5

I see what you mean.  But "exercise" right now as it is defined "exercise = e(i)" is continuous variable (ie. the measured values of 20 exercises * 50 subjects).  Maybe if I assign another variable "exerclist" to the exercise values (similar to varlist), and put exerclist in the class statement.  But then that wouldn't use the exercise values...

any thoughts around that?

thanks.

Ming
Calcite | Level 5

this is the output from the PROC mixed with "exercise" in the class statement.

I changed the code in the data long for the exercise=i instead of exercise=e(i).  The output is the same except for exercise=i that makes it a nominal variable and with exercise=e(i), the nominal variable will show 526 values.

error message is still the same,

ERROR: Unable to allocate sufficient memory: a request for 2760K bytes exceeded the 1254K

       available. Note that the deficit amount may not be the amount of memory needed for a

       successful run, since it does not reflect subsequent allocations by this or other

       processes.

ERROR: The SAS System stopped processing this step because of insufficient memory.

mixed.png

thanks.

ming

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 26 replies
  • 3024 views
  • 6 likes
  • 3 in conversation