BookmarkSubscribeRSS Feed
kirkwanderson
Fluorite | Level 6

Hello,

I am using PROC SURVEYREG for the first time.  I believe I have the WEIGHT, CLUSTER, and STRATA statements/variables figured out correctly, but am struggling with a couple of things:

It seems that SURVEYREG has limited capability compared to, say, PROC REG.  It appears that I cannot get VIF values.  Is this the case?  If so, how should I assess collinearity without a great deal of effort?

Regarding residual diagnostics, I have a couple of issues.  One is left-skewed residuals, and the other is a downward sloping trend in the residuals vs. predicted values plot.  FYI, the Y data is left-skewed and some of the X variables are discrete, although certainly ordinal.  I have tried several transformations of Y to no avail.  I am most concerned with the residual plot.  The only guidance I can find for this issue assumes that my data were collected over time, but this is not the case.  Does it have anything to do with how the SURVEYREG procedure accounts for the sampling scheme? 

Thanks,

Kirk

proc surveyreg data=edat.ECLSKclean;

model C7R4RSCL = C7SDQRDC C7SDQINT C7LOCUS C7CONCPT W8SESL

       / anova adjrsq clparm deff inverse xpx;

cluster C67CSTR;

strata C7TCWPSU / list;

weight C7CW0;

output out=edat.ECLSKresids predicted=fitted residual=resids;

run;

proc univariate data=edat.ECLSKresids plots;

var resids;

histogram resids / normal;

probplot resids / normal(mu=est sigma=est);

run;

proc gplot data=edat.ECLSKresids;

plot resids*fitted;

run;

6 REPLIES 6
1zmm
Quartz | Level 8

For collinearity diagnostics among the independent variables, use PROC REG's MODEL statement option, COLLIN.  This will write out condition indexes for a set of independent variables and the corresponding variance proportions for each of the independent variables.  Condition indexes exceeding 30 (or 10, if you do not include an intercept term) identify sets of independent variables that may be collinear if their corresponding variance proportions are closer to 1.000 (say above 0.5).  These condition indexes and variance proportions are preferable to the VIF statistics in identifying collinearity.  PROC VARCLUS also allows you to cluster correlated variables.  Either way, you can select one of the variables in a set identified as collinear or correlated as an independent variable for your model using either statistical criteria or subject-matter knowledge.  In the PROC REG "paragraph", you can include the respondent sampling weight in a WEIGHT statement, but you need not consider the sample design variables, the stratum variable or the cluster variable.

Be sure to sort your observations  by the stratum variable and the cluster variable before you run PROC SURVEYREG.  The usual recommendation to "normalize" left-skewed variables is to raise them to a positive power (squaring or cubing, etc.).  However, the main concern is skewing of the residuals, not the original dependent variable.  If the residuals are skewed, then this may indicate outliers, influential data points, or an inadequately specified model.  A linear trend of the residuals against an independent variable indicates that the model does not account for a linear effect in that independent variable.

kirkwanderson
Fluorite | Level 6

Thank you, that is helpful.  Following your advice using COLLIN, I do not have any collinearity issues.  However, I still have left-skewed residuals and a downward sloping residual plot.  The residual plot slopes downward for the full model with 5 independent variables.  I was not able to blame any particular X variable, as running simple regression models with each X on its own produces a downward sloping residual plot.  Could you, or anyone reading this, explain further the sentence "A linear trend of the residuals against an independent variable indicates that the model does not account for a linear effect in that independent variable" ?  I have read this elsewhere as well, but don't know how to make my model account for any missing linear effects.

1zmm
Quartz | Level 8

The SAS code you provided two weeks ago plots residuals (Y-axis) against the predicted/fitted values.  If the residuals show a downward slope against each of the independent variables, I wonder if the ordinal nature of some of these independent variables may account for this:  Perhaps when modelled as interval-ratio variables, these ordinal variables do not account for the linear effect of these variables.  Perhaps you should model these ordinal variables not as interval-ratio variables but instead using indicator variables created in a prior DATA step or as nominal variables using the PROC SURVEYREG CLASS statement.

On another issue, can you explain why the residuals are so left-skewed?

kirkwanderson
Fluorite | Level 6

I wasn't very clear about the variables.  The Y variable is a reading achievement score, which is continuous.  These scores are left-skewed, which probably explains the left-skewed residuals.  Frankly, the graphs don't look terrible.  The outliers are not extreme.  The formal hypothesis tests reject the null hypothesis of normality, though.

The first X variable is reading interest/competence, which is somewhat discrete.  The values in the first column contains the reading interest/competence scores, and the second column shows the frequencies.  As discrete as it is, it would not be appropriate to create dummy variables or treat it as a CLASS variable.

1194
1.25306 
1.339
1.5561 
1.6717 
1.75737
21028   
2.251061    
2.3323       
2.51215    
2.6728   
2.75914     
3971     
3.25727      
3.3319     
3.5596     
3.6714     
3.75420      
4405

The other 4 X variables are essentially continuous.  The data is from the ECLS-K study, if you happen to be familiar with it.

_maldini_
Barite | Level 11

Which options evaluate collinearity using PROC SURVEYLOGISTIC? 

SAS_Rob
SAS Employee

There really isn't a consensus as to how to compute collinearity diagnostics for complex survey data.  Some suggest that you simply use Proc REG with a WEIGHT statement and the COLLIN and VIF options on the MODEL statement since collinearity affects only the independent variables.  Others suggest that you compute a model-based VIF, etc. that includes the design effect.  One such paper that discusses this, and that you could program yourself, is linked below.

Variance inflation factors in the analysis of complex survey data (statcan.gc.ca)

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 6 replies
  • 6751 views
  • 10 likes
  • 4 in conversation