Hi,
This is a fairly naive question: I am trying to create a regression model for somewhat skewed and clustered survey data. A professor suggested I use maximum likelihood estimation with GLS, rather than OLS, to account for some of the heteroskedasticity and autocorrelation in my data. As far as I am aware, PROC REG uses OLS, PROC GLM uses ML, and PROC MIXED uses REML. However, these methods (see code below) all seem to yield the same estimates. Why is this? Shouldn't changing the estimation method change my estimates and standard errors?
proc reg;
Model y= x1 x2 x3 x4 x5;
weight weights;
run;
proc glm;
Model y= x1 x2 x3 x4 x5 /solution;
weight weights;
run;
proc mixed Method=REML;
Model y= x1 x2 x3 x4 x5 /solution;
weight weights;
run;
PROC REG and PROC GLM use OLS. PROC MIXED uses maximum likelihood or REML. If you are really interested in a regression for non-normal error structures, you might want to look into PROC GLIMMIX instead of PROC MIXED.
There are cases where OLS and ML should give the same result, namely if the errors in the regression model are i.i.d. normal.
In your case, you say your data is skewed and clustered ... well, if the errors are i.i.d normal, you can still use OLS. If there is heteroskedasticity, this too can be handled in PROC REG or PROC GLM using weighted least squares, which is a feature of these two PROCs.
The claim of autocorrelation makes me somewhat confused, as autocorrelation usually arises in time series data and not survey data. In survey data (as I understand the term) there is no natural ordering (like there is in time series data) so the idea of autocorrelation in survey data makes no sense to me. Can anyone explain further?
So anyway, nothing you have said to me rules out the use of OLS, although clearly there may be issues that haven't been mentioned that would indeed rule out OLS. Of course if the distribution of the errors is not normal, then maybe you want PROC GLIMMIX and not PROC MIXED, or maybe you just need to transform the data (if possible) so that the errors are normally distributed and then use OLS.
Of course, if your professor expects you to use PROC MIXED, then maybe you should ...
PROC REG and PROC GLM use OLS. PROC MIXED uses maximum likelihood or REML. If you are really interested in a regression for non-normal error structures, you might want to look into PROC GLIMMIX instead of PROC MIXED.
There are cases where OLS and ML should give the same result, namely if the errors in the regression model are i.i.d. normal.
In your case, you say your data is skewed and clustered ... well, if the errors are i.i.d normal, you can still use OLS. If there is heteroskedasticity, this too can be handled in PROC REG or PROC GLM using weighted least squares, which is a feature of these two PROCs.
The claim of autocorrelation makes me somewhat confused, as autocorrelation usually arises in time series data and not survey data. In survey data (as I understand the term) there is no natural ordering (like there is in time series data) so the idea of autocorrelation in survey data makes no sense to me. Can anyone explain further?
So anyway, nothing you have said to me rules out the use of OLS, although clearly there may be issues that haven't been mentioned that would indeed rule out OLS. Of course if the distribution of the errors is not normal, then maybe you want PROC GLIMMIX and not PROC MIXED, or maybe you just need to transform the data (if possible) so that the errors are normally distributed and then use OLS.
Of course, if your professor expects you to use PROC MIXED, then maybe you should ...
Thanks. My data is cross-sectional, but the survey was collected over a long period of time, so I thought time might cluster my data in some meaningful way. I will use weighted least squares to get rid of some of the clustering, and take the log of my dependent variable to account for the heteroskedasticity.
so I thought time might cluster my data in some meaningful way. I will use weighted least squares to get rid of some of the clustering
So you are going to perform clustering on your data and then eliminate the clustering?
If your data is survey data ,then take a look at PROC SURVEYREG .
How big of your sample data ? If it is small sample , I would recommend to use OLS ,on account of unbiased estimator ,whereas ML is biased estimator .
Only you have lots of obs , ML is recommend to use .
Xia Keshan
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.