SAS Support Communities

1zmm · ‎01-07-2020

1zmm · ‎09-25-2013

The SAS code you provided two weeks ago plots residuals (Y-axis) against the predicted/fitted values. If the residuals show a downward slope against each of the independent variables, I wonder if the ordinal nature of some of these independent variables may account for this: Perhaps when modelled as interval-ratio variables, these ordinal variables do not account for the linear effect of these variables. Perhaps you should model these ordinal variables not as interval-ratio variables but instead using indicator variables created in a prior DATA step or as nominal variables using the PROC SURVEYREG CLASS statement. On another issue, can you explain why the residuals are so left-skewed?

1zmm · ‎09-24-2013

I don't know the answer to your question. Another article deals directly with your question (Statistics in Medicine 24:3089-3110). However, with PROC LIFETEST, you might consider the FREQ statement option, NOTRUNCATE, that prevents truncation on non-integer weights and the use of Wilcoxon scores (=n_i) instead of the default log-rank test scores (=1) so that these weights are incorporated in the estimation.

1zmm · ‎09-24-2013

You might also use PROC FASTCLUS: 1. Use the SEED= option of the PROC FASTCLUS statement to include a data set of observations around which you want other "new" observations to cluster; 2. Use the DATA= option of the PROC FASTCLUS statement to include a data set of the "new" observiations to be clustered; 3. Set the MAXITER option of the PROC FASTCLUS statement to zero (MAXITER=0) to prevent the procedure from changing the original central "seed" observations (see #1 above). Since PROC FASTCLUS is designed for interval/ratio variables, you can only incorporate nominal variables by creating separate clusters of observations within each category of the nominal variables. To do this, sort by the nominal variables beforehand, and use the BY statement in PROC FASTCLUS.

1zmm · ‎09-24-2013

Removing a lower order "main effect" term that does not statistically significantly affect a dependent variable in a model when this term's interaction with another main effect term does statistically significantly affect the dependent variable is usually not recommended. Doing so assumes that the regression coefficient term of the non-significant main effect term equals zero, which usually one does not have prior evidence of. The following reference includes arguments about this issue and cites earlier references on this topic: Nelder JA. The selection of terms in response-surface models--How strong is the weak heredity principle? The American Statistician 1998 Nov;52(4):315-318. These arguments hold true in the situation where you are trying to select/identify explanatory independent variables. They may be irrelevant in the situation of prediction, where the selection of explanatory variables is less relevant. For example, a model containing only the highest order interaction term ("cell means" model) may be perfectly suitable because the interest is in the effect on the dependent variable of the multiplicative combinations of terms rather than in this effect of the individual terms comprising the combinations.

1zmm · ‎09-14-2013

For collinearity diagnostics among the independent variables, use PROC REG's MODEL statement option, COLLIN. This will write out condition indexes for a set of independent variables and the corresponding variance proportions for each of the independent variables. Condition indexes exceeding 30 (or 10, if you do not include an intercept term) identify sets of independent variables that may be collinear if their corresponding variance proportions are closer to 1.000 (say above 0.5). These condition indexes and variance proportions are preferable to the VIF statistics in identifying collinearity. PROC VARCLUS also allows you to cluster correlated variables. Either way, you can select one of the variables in a set identified as collinear or correlated as an independent variable for your model using either statistical criteria or subject-matter knowledge. In the PROC REG "paragraph", you can include the respondent sampling weight in a WEIGHT statement, but you need not consider the sample design variables, the stratum variable or the cluster variable. Be sure to sort your observations by the stratum variable and the cluster variable before you run PROC SURVEYREG. The usual recommendation to "normalize" left-skewed variables is to raise them to a positive power (squaring or cubing, etc.). However, the main concern is skewing of the residuals, not the original dependent variable. If the residuals are skewed, then this may indicate outliers, influential data points, or an inadequately specified model. A linear trend of the residuals against an independent variable indicates that the model does not account for a linear effect in that independent variable.

1zmm · ‎09-08-2013

In a 2-by-2 table where a=the number where observer #1 records a positive value and observer #2 records a positive value; b=the number where observer #1 records a positive value and observer #2 records a negative value; c=the number where observer #1 records a negative value and observer #2 records a positive value; d=the number where both observers #1 and #2 record a negative value; and N=a + b + c + d, then prevalence index = (a - d)/N, bias index=(b-c)/N, proportion of agreement = (a + d)/N, and PABAK = 2*(a + b)/N - 1. The prevalence index measures how much the proportion of positive results differs substantially from 0.50. The large negative prevalence index for your data implies that this proportion of positive results does differ substantially from 0.50, specifically, that d is much larger than a. The bias index describes how much the two observers differ on the proportion of positive results. The bias index for your data, -0.04, is close to zero, indicating that the two observers do NOT differ very much on the proportion of positive results. The large negative value of prevalence index implies that your observers rated a large proportion of the results as negative, and the large proportion of agreement on these negative results, 0.94, indicated that both observers agreed on these negative results. However, the small proportion of agreement on the positive results, 0.00, indicated that both observers did NOT agree at all on the positive results. Therefore, the original kappa statistic probably does summarize your results best of all by averaging the overall agreement on positive and negative results correcting for chance agreement: The kappa statistic indicates no agreement between the observers on their ratings better than chance. The prevalence index shows why this is so: You don't have enough situations with positive responses where both raters can provide a rating. Enrich your sample with situations that increase the proportion of positive responses. Finally, I don't think the PABAK statistic is informative here because it adjusts both for the bias and the prevalence without indicating how to remedy the problem. The PABAK statistic is a function of the percentage of observed agreement and does not provide any further information than that. To answer your question, the risk communication between physician and patient was NOT present, statistically significantly more than one might expect by chance. Further studies would require more situations in which positive responses by the raters could be evaluated.

1zmm · ‎09-07-2013

I agree with Reeza. See the following paper from the 2009 SAS User Group Proceedings: http://support.sas.com/resources/papers/proceedings09/242-2009.pdf.

1zmm · ‎09-05-2013

This output shows that the results for only 32 study subjects and only 855 out of 1,455 observations were used in the PROC MIXED analysis. This loss seems somewhat excessive. In previous abbreviated lists of data on this Internet site, a few variables had missing values. Do you think that this is why so many observations were excluded from the analysis? It may be worthwhile to check the pattern of missing dependent variables and independent variables to determine whether you can exclude some variables with large percentages of missing values from the analysis. For example, you can display the pattern of missing values by using the procedure, PROC MI, as follows: proc mi nimpute=0; var y var task height weight exerc1-exerc20; run; quit; I followed the PROC MIXED documentation for the MODEL statement syntax including the VAR*TASK interaction as an independent variable (see the documentation for the REPEATED statement option, TYPE, for multivariate repeated measures. However, if the model that includes this VAR*TASK interaction does NOT converge, and if the model that excludes this interaction DOES converge, then the latter model may be the way to go. Another alternative model would be to use the MODEL statement that includes only the VAR*TASK interaction without including the main-effect terms, VAR or TASK, as independent variables (the "cell means" model): model response=var*task height weight . . . .; I prefer to use the condition number rather than the VIF statistic in PROC REG to identify groups of highly correlated independent variables. A condition number of 30 or more (10 or more in models without intercept terms) and independent variables with relatively large variance proportions will identify such group(s). Then, from subject matter knowledge or from the use of PROC VARCLUS to group highly correlated variables, you can select one or a few variables within each group to represent all the highly correlated variables in the group for use as independent variables in subsequent modelling. Since only one of the 20 exercises (exerc7) remained statistically significantly associated with the response after adjusting for VAR and TASK, this specific exercise may or may not be substantively associated with the response (perhaps a multiple comparison problem); perhaps subject-matter knowledge about what this exercise represents will help you to decide whether to keep it in the model or not. However, to decide whether or not to include it in your assessment of the value of these exercises in predicting the response by comparing the information criteria of different models, I would compare the model having only VAR, TASK, WEIGHT, and HEIGHT as independent variables with the model having VAR, TASK, WEIGHT, HEIGHT, and all 20 of the exercise variables (not just EXERC7) because you didn't have any a priori reason to select only EXERC7.

1zmm · ‎09-04-2013

Your experiment does not really have 934 subjects, only 50. Back on August 22nd, I sent some code that might be modified (see below) and work without error messages about infinite likelihoods. Although PROC MIXED does not print an R-squared statistic as PROC GLM does, the IC option on the PROC MIXED statement will have SAS print various information criteria that allow you to compare the two approximately nested models (without and with the various values for the twenty exercise variables). I also include a PROC REG paragraph to check for collinearity among the exercise variables and a PROC VARCLUS paragraph to see if the exercise variables group into correlated variable clusters. Matthew Zack ========================================================================================================================= * The first two of these procedures assume that only one observation per study subject is selected; * and that this observation contains the values for all 20 exercise variables.; * If this is false, then select only one observation per study subject in prior code (see below).; * Sort the original, WIDE data set; * by study subject; proc sort data=wide; by subject; run; * Select only one of the study subjects into a new data set; * for analysis of the collinearity statistics and variable clustering; * of the exercise variables.; data wide2; set wide; by subject; if (first.subject eq 1) then output wide2; run; * Check the exercise variables for collinearity.; * Because these collinearity diagnostics are relevant only to the independent variables,; * the value of the dependent variable is not relevant.; proc reg data=wide2; model comp=exerc1-exerc20 / collin; title10 "Collinearity diagnostics for the 20 exercise variables"; run; quit; * Check if the exercise variables group into clusters of correlated variables; proc varclus data=wide2; var exerc1-exerc20; title10 "Check if the exercise variables group into clusters of correlated variables"; run; * For each dependent variable, use the original, WIDE data set to creata a separate dependent variable; * with the same name, RESPONSE, but indexed by the variable, VAR.; data long(drop=j comp shear1 shear2); length var $ 12; set wide; array dv{3} comp shear1 shear2; array varlist{3} $ 12 _temporary_ ("comp" "shear1" "shear2"); do j=1 to 3; response=dv{j}; var=varlist{j}; output long; end; run; * Sort data; * by subject, task, and distinct dependent variable; proc sort data=long; by subject task var; run; * Model dependent variables as multivariate repeated measures; * Determine the effects of the task on the response; * May vary the REPEATED statement variance-covariance matrix from TYPE=UN@CS to TYPE=UN@UN; proc mixed data=long ic; class subject task var; model response = var task var*task / solution ddfm=kenwardroger; repeated var task / type=un@cs subject=subject rcorr; run; quit; * Model dependent variables as multivariate repeated measures; * Determine the effects of the task and the exercise values on the response; * May vary the REPEATED statement variance-covariance matrix from TYPE=UN@CS to TYPE=UN@UN; proc mixed data=long ic; class subject task var; model response = var task var*task exerc1-exerc20 / solution ddfm=kenwardroger; repeated var task / type=un@cs subject=subject rcorr; run; quit;

1zmm · ‎09-02-2013

Since the values for each of exercises and for each of the tasks are connected to the three dependent variables only through their occurring in the same study subject, the tasks are not really nested within the exercises or vice-versa. If you return to the original "short-but-wide" (few observations but many variables) data set with each of the values of the 20 exercises as a separate variable, you could include these exercises as independent variables in your model: * Simple MANOVA model for each task alone; proc glm; class task; model comp shear1 shear2=task / solution; manova h=_all_ / printh printe summary; run; quit; * Add in the 20 exercise variables; proc glm; class task; model comp shear1 shear2=task exerc1-exerc20 / solution; manova h=_all_ / printh printe summary; run; quit; You can compare the r-squared statistic for each task alone on each of the three dependent variables or the statistic, 1 - Wilks' lambda, for the r-squared statistic for each task alone on all three of the dependent variables. Then see how much these r-squared statistics change when the 20 exercise variables are added to the model. One problem with this approach is that you have very many independent variables (an intercetp term, seven indicator variables for the 8-category task variable, and 20 exercise variables) for relatively few study subjects (N=50); the regression coefficients for these independent variables may be very imprecise. PROC GLM will also delete observations when any of the dependent or independent variables are missing so that these analyses may be based on even fewer observations. This approach also does not account for the fact that the observations for the tasks are correlated observations within each study subject; you'd have to use PROC MIXED again with the doubly multivariate approach to account for these within-subject correlations (see previous discussions). A final problem with this approach is that it does not account for possible collinearity among the twenty exercise values. You could check this out by using the collinearity diagnostics available in PROC REG. Select only one observation from each study subject, and include all the exercises as independent variables in a model with any one of the three dependent variables as the dependent variable: proc reg; model comp=exerc1-exerc20 / collin; run; quit; If any of the condition indexes in the collinearity diagnostics are 30 or more, then variables with "large" variance proportions are probably close to collinear. Then, only one of these collinear values of exercise need to be in the model. PROC VARCLUS also "clusters" correlated variables, of which you can select one to represent all variables in a given cluster. Message was edited by: Matthew Zack

1zmm · ‎08-31-2013

Steve Denham's suggestion about using the "best-subset" selection algorithm for independent variables in PROC LOGISTIC would give you a good clue about "important" independent variables. Also consider PROC GLMSELECT that selects "good" sets of independent variables for models that are less affected by the biases in the usual forward and backwards stepwise selection methods. Given that you have more than 30 independent variables, this implies more than one billion possible models; thus, using exhaustive searches through macros that successively select sets of independent variables is probably less feasible than the above two alternatives. You may consider reducing the number of independent variables by using a method like PROC VARCLUS to "cluster" the independent variables and by then selecting one or a few of these variables to represent a given variable cluster. Finally, you have the problem of selecting an appropriate variance-covariance/correlation matrix among the repeated measures. This compounds the selection problem you have.

1zmm · ‎08-31-2013

I really don't understand your data. Within a particular subject at the same degree of RESPONSE, every variable except EXERCISE has the same value (for example, in subject S02). Because EXERCISE can vary by almost 100-fold (for example, from 8.73 to 831), although the value of the RESPONSE does not change (for example, COMP=6,641), EXERCISE should have no predictive power as an independent variable on this RESPONSE (and other responses as well). Perhaps you should model the RESPONSEs (COMP, SHEAR1, and SHEAR2) on the other independent variables that do NOT change--TASK, GROUP, TIME, HEIGHT, and WEIGHT. I do not see how the values of EXERCISE affect RESPONSE. I also don't understand what these values of EXERCISE represent since they appear to change while the values of the variables that do NOT change remain the same.

1zmm · ‎08-28-2013

After all of this discussion, I did not realize that exercise was a continuous variable. Your first message describing your question (August 14, 2013 at 4:41 pm) describes your 50 subjects as performing "20 different TYPES of exercise (vertical jump, . . .)", implying that exercise is a categorical/nominal variable with 20 different values. Now, it appears that exercise is a continuous value that measures the performance (in some units) on these different types of exercise. This distinction between a nominal and a continuous value is important because it would make a difference in the model. Obviously, continuous variables should NOT be placed in a CLASS statement but should be placed as a covariate/independent variable in the MODEL statement. Since you have already excluded EXERCISE from the CLASS statemnte (yesterday afternoon's message), that appears to be the appropriate approach to take. The error message/note that you received about an "infinite likelihood. . . because of a nonpositive definite estimated R matrix" is, according to the PROC MIXED documentation, "usually no cause for concern if the iterations continue". It may indicate that "observations from the same subject are producing identical rows" in the R matrix; that is, the same subject has duplicate covariates (VAR and TASK) in the PROC MIXED REPEATED statement. Check if and why this might be so. You might also use the R and the RCORR options of the REPEATED statement to print/display the first two blocks of the repeated-measures variance-covariance matrix and its corresponding correlation matrix (see the PROC MIXED REPEATED statement documentation: for example, R=1,2 and RCORR=1,2). To reduce memory requirements for PROC MIXED, you can also use the two other tips I suggested for reducing the time the procedure takes: Change the multivariate variance-covariance TYPE in the REPEATED statement from TYPE=UN@UN to TYPE=UN@CS, or make the categorical version of EXERCISE into a BY-variable and analyze the effect on the DVs through different TASKs of each kind of EXERCISE separately.

1zmm · ‎08-27-2013

Deleting EXERCISE from the CLASS statement means that SAS interprets the TASK|EXERCISE "independent variable" in the MODEL statement as consisting of a nominal variable, TASK; a continuous variable, EXERCISE; and an interaction term between a nominal variable, TASK, and a continuous variable, EXERCISE. Since EXERCISE is not a continuous variable but a nominal variable, this model does not make any sense even if it runs much faster than the original model with EXERCISE as a nominal variable. The PROC MIXED documentation describes several techniques to reduce the running time. One possibility is to include changing the variance-covariance matrix TYPE in the REPEATED statement from TYPE=UN@UN to TYPE=UN@CS, which will reduce the number of parameters to estimate for this matrix at the expense of assuming a single compound symmetry (CS) parameter to model the variability in the TASKs. The second possibility is to analyze your data in pieces using a BY-variable (for example, BY EXERCISE), though this has its own problems.

Online Status	Offline
Date Last Visited	‎01-07-2020 02:48 PM

SAS Support Communities

Re: 2 dimensional table with control of the order of categories

Re: PROC SURVEYREG questions

Re: Creating Adjusted Survival Curves

Re: Nearest neighbour between two datasets

Re: Significant interaction but one main effect not sig

Re: PROC SURVEYREG questions

Re: Kappa statistics for inter-rater reliability

Re: Kappa statistics for inter-rater reliability

Re: help with trimming independent variables in a confusing experiment...

Re: help with trimming independent variables in a confusing experiment...

Re: Sample Size for Survival using Historical Control

Re: PROC SURVEYREG questions

Re: PROC SURVEYREG questions

Re: determine significant cut-off points

Re: Sample size for a rare event study

Re: 2 dimensional table with control of the order of categories

Re: PROC SURVEYREG questions

Re: Creating Adjusted Survival Curves

Re: Nearest neighbour between two datasets

Re: Significant interaction but one main effect not sig

Re: PROC SURVEYREG questions

Re: Kappa statistics for inter-rater reliability

Re: Kappa statistics for inter-rater reliability

Re: help with trimming independent variables in a confusing experiment...

Re: help with trimming independent variables in a confusing experiment...

Re: help with trimming independent variables in a confusing experiment...

Re: Model selection using proc genmod

Re: help with trimming independent variables in a confusing experiment...

Re: help with trimming independent variables in a confusing experiment...

Re: help with trimming independent variables in a confusing experiment...

Follow Us

What is...