09-10-2013 02:49 PM
I am using PROC SURVEYREG for the first time. I believe I have the WEIGHT, CLUSTER, and STRATA statements/variables figured out correctly, but am struggling with a couple of things:
It seems that SURVEYREG has limited capability compared to, say, PROC REG. It appears that I cannot get VIF values. Is this the case? If so, how should I assess collinearity without a great deal of effort?
Regarding residual diagnostics, I have a couple of issues. One is left-skewed residuals, and the other is a downward sloping trend in the residuals vs. predicted values plot. FYI, the Y data is left-skewed and some of the X variables are discrete, although certainly ordinal. I have tried several transformations of Y to no avail. I am most concerned with the residual plot. The only guidance I can find for this issue assumes that my data were collected over time, but this is not the case. Does it have anything to do with how the SURVEYREG procedure accounts for the sampling scheme?
proc surveyreg data=edat.ECLSKclean;
model C7R4RSCL = C7SDQRDC C7SDQINT C7LOCUS C7CONCPT W8SESL
/ anova adjrsq clparm deff inverse xpx;
strata C7TCWPSU / list;
output out=edat.ECLSKresids predicted=fitted residual=resids;
proc univariate data=edat.ECLSKresids plots;
histogram resids / normal;
probplot resids / normal(mu=est sigma=est);
proc gplot data=edat.ECLSKresids;
09-14-2013 01:32 PM
For collinearity diagnostics among the independent variables, use PROC REG's MODEL statement option, COLLIN. This will write out condition indexes for a set of independent variables and the corresponding variance proportions for each of the independent variables. Condition indexes exceeding 30 (or 10, if you do not include an intercept term) identify sets of independent variables that may be collinear if their corresponding variance proportions are closer to 1.000 (say above 0.5). These condition indexes and variance proportions are preferable to the VIF statistics in identifying collinearity. PROC VARCLUS also allows you to cluster correlated variables. Either way, you can select one of the variables in a set identified as collinear or correlated as an independent variable for your model using either statistical criteria or subject-matter knowledge. In the PROC REG "paragraph", you can include the respondent sampling weight in a WEIGHT statement, but you need not consider the sample design variables, the stratum variable or the cluster variable.
Be sure to sort your observations by the stratum variable and the cluster variable before you run PROC SURVEYREG. The usual recommendation to "normalize" left-skewed variables is to raise them to a positive power (squaring or cubing, etc.). However, the main concern is skewing of the residuals, not the original dependent variable. If the residuals are skewed, then this may indicate outliers, influential data points, or an inadequately specified model. A linear trend of the residuals against an independent variable indicates that the model does not account for a linear effect in that independent variable.
09-25-2013 02:47 PM
Thank you, that is helpful. Following your advice using COLLIN, I do not have any collinearity issues. However, I still have left-skewed residuals and a downward sloping residual plot. The residual plot slopes downward for the full model with 5 independent variables. I was not able to blame any particular X variable, as running simple regression models with each X on its own produces a downward sloping residual plot. Could you, or anyone reading this, explain further the sentence "A linear trend of the residuals against an independent variable indicates that the model does not account for a linear effect in that independent variable" ? I have read this elsewhere as well, but don't know how to make my model account for any missing linear effects.
09-25-2013 04:02 PM
The SAS code you provided two weeks ago plots residuals (Y-axis) against the predicted/fitted values. If the residuals show a downward slope against each of the independent variables, I wonder if the ordinal nature of some of these independent variables may account for this: Perhaps when modelled as interval-ratio variables, these ordinal variables do not account for the linear effect of these variables. Perhaps you should model these ordinal variables not as interval-ratio variables but instead using indicator variables created in a prior DATA step or as nominal variables using the PROC SURVEYREG CLASS statement.
On another issue, can you explain why the residuals are so left-skewed?
10-01-2013 03:35 PM
I wasn't very clear about the variables. The Y variable is a reading achievement score, which is continuous. These scores are left-skewed, which probably explains the left-skewed residuals. Frankly, the graphs don't look terrible. The outliers are not extreme. The formal hypothesis tests reject the null hypothesis of normality, though.
The first X variable is reading interest/competence, which is somewhat discrete. The values in the first column contains the reading interest/competence scores, and the second column shows the frequencies. As discrete as it is, it would not be appropriate to create dummy variables or treat it as a CLASS variable.
The other 4 X variables are essentially continuous. The data is from the ECLS-K study, if you happen to be familiar with it.