12-19-2016 02:43 PM
I tried to use PROC GLM to fit a model without an intercept term, yet with a weight variable. The SAS lines would read like such: PROC GLM; Model _dependent variable_ = list of independent variables/noint; weight _weight variable_. From the model fitting output, I can see the usual statistics, SSE, MSE, and R square. Then I tried to calculate R square after outputing the actuals and fitted values. But I got a different R square value from the straightforward SAS output. To calculate R square, I used the simple formula: R square = 1 - (residual sum of squares/total sum of squares). Since there was a weight variable, for each observation, both squared terms were weighted by the weight variable before summing up, i.e., weight*(actual-fitted)^2 and weight*(actual - average of actuals)^2. Was there anything incorrect about the manual derivation for R square? Could anyone help clear it up? Thanks!
12-19-2016 02:59 PM
There is no need to guess. The SAS documentation includes a chapter that shows the basic statistics that are computed in regression procedures.
Your formulas for R-squared and SSE seem to match the formulas in the documentation. For the total sum of squares, did you use the weighted mean?
12-19-2016 03:50 PM
just tried to replace the average of the actuals with the average of the weighted actuals in the total sum of squares calculation. This time R square becomes much smaller and further away from the R square by SAS output.
12-20-2016 08:57 AM
Here is how to reproduce the numbers. Since you didn't provide data, I will use the following model:
proc glm data=sashelp.class plots=none; weight weight; model height = age; output out=Out Residual=Resid; ods select OverallAnova FitStatistics; quit;
As you say, the R-squared value should be formed by the values in the "Sum of Squares" column in the OverallANOVA table. The following DATA _NULL_ step verifies the calculation:
data _null_; SS_Total = 43699.97089; SS_Error = 16000.45958; RSquared = 1 - SS_Error / SS_Total; put RSquared=; run;
OK, so we know that R-squared is correct. How can we verify the SS_Total and SS_Error calculation? Well, SS_Total doesn't even use the model, it is just the corrected sum of squares for the response variable. Calling PROC MEANS reproduces the SS_Total:
proc means data=Sashelp.class CSS; weight weight; var height; run;
What about the SS_Error? Well, that's just the weighted sum of the residuals. I output the residuals into the OUT dataset. The following PROC MEANS verifies the SS_Error as the (uncorrected) weighted SS of the residuals:
proc means data=Out USS; weight weight; var Resid; run;