08-17-2013 05:28 AM
help on SAS Output
Hi everyone, new to the community and looking for some help.
I'm currently in the process of writing my thesis and having some difficulties with interpretation of my regression analysis. My mentor has just sent me the result of regression analysis and I have no idea, from where I am supposed to begin. I have some idea about p-value and how one can interpret collinearity diagnostics. However, I have no idea how i can go further with F-Statistics or I don't know whether I really need them. I attached the results, I would be very appreciated any feedback and offer you're able to offer. You can also also offer some books or papers, which are useful for interpretation.
thanks in advance
08-17-2013 10:47 AM
You appear to be showing the result of a multiple linear regression using the divm1 and ci1 as predictors and with MODEL1 as the dependent outcome variable.
The F - Statistic is given by the Mean Square Model (average variability given by model) divided by the Mean Square Error (average variability unexplained by the model).
If the F statistic is much larger than 1 (yours is much larger at 434), there is a low probability of this happening by accident, and we can say with more confidence that the model explains the variability we see between your predictors (divm1 and ci1) and your dependent variable MODEL1 than just guessing randomly.
As your F-statistic is 434, SAS has calculated that the chances of this happening randomly when there is no actual relation between divm1, ci1 and your regressor are less than 0.001%. This will pass pretty much any statistical threshold out there: Your inputs do explain at least some of the variability seen in your output, and are probably better than guessing randomly.
The adjusted R-Square is at 69% so we can say that the predictors divm1 and ci1 account for roughly 69% of the variability seen within your dependent variable.
In the parameter section you can see a breakdown for each component of your model.
Intercept - This is the new 'average' that is assigned to all cases.
dixm1 - The parameter estimate is 0.71799, meaning that when the dixm1 characteristic for a case increases by 1, our guess for the value of the dependent variable increases by 0.71799. The T-Test is similar to the F-test in that is tests the hypothesis of whether using the estimate of 0.71799 for divm1 is statistically superior to just guessing (using the sample average only). in this case we can see that we can say with more than 99.99% certainty that using divm1 as a predictor with the value 0.71799 is better than nothing.
ci1 - Same as above, for each unit increase in ci1 we can expect a 0.02188 increase in out outcome variable. We can say that this relationship actually exists with a certainty of 99.16%.
It is a good idea to set your significance threshold prior to the regression, It is common practice to give SAS a set alpha value so that is knows which predictors are significant. The default is alpha=0.05, meaning that when we are 95%+ sure of something we can keep it in our model.
08-17-2013 12:54 PM
I appreciate you taking the time to answer. But could you please explain how you reach 99.99% certainty by saying "in this case we can see that we can say with more than 99.99% certainty" referring to T-test. Actually, in this case t-value is 28.39
Do you have any idea about scatter residual vs predicted plots interpretation?
Indeed, analysis was done by my mentor, so I don't know which significance threshold she used. But before I write my report, I ask her.
08-17-2013 02:49 PM
If you repeated this study 1000's of times, the probability that the association between the independent variable, DIVM1, and the dependent variable, DIV1, in this model would be as large or larger than the observed regression coefficient of 0.71799 is less than 0.0001 [=the probability that the t-statistic for this regression coefficient, 28.39 (=0.07199/0.02529), equals 0.00]. Murray_Court just subtracted 0.0001 from 1.0000 to describe his interpretation of "99.99% certainty", which I don't endorse. Note that any conclusion about the association and the size of the association between DIVM1 and DIV1 depends on the model [that is, the other independent variables in the model, the form of those other independent variables (for example, the inclusion of squared terms or interaction terms, and the form of the model (=multiple linear or ordinary least-squares regression in your example). With respect to other statistics listed in your output, the last (third) "Bedingungs-index" under "Collinearity Diagnostics" is less than 30 for all your models, indicating that your independent variables are not highly correlated ("collinear") with one another such that their presence together may make it difficult to estimate precisely the size of their associations with the dependent variable. In your first model, the Durbin-Watson statistic of 1.65 at a significance level of 0.05 and a sample size of 384 is inconclusive, and the estimated autocorrelation is not too large to indicate serial correlation due to the order of the observations in your data set; thus, you do not need to account for serial correlation in this model. With respect to the residual plots, the plots of the residuals or the studentized residuals on the Y-axis vs. the predicted values on the X-axis should resemble a horizontal band centered at Y=0. In your first model, these plots resemble more a fan opening to the right, indictating that the variance may change with increasing size of the predicted value (that is, heteroskedasticity). Two large predicted values have very large negative studentized residuals, and several other predicted values with studentized residuals either larger or smaller than about two standard errors from zero, indicating potential outliers. The quantile plot of the residuals of the observations shows deviations from a straight line in both tails, also indicating negative (left-tail) and positive (right-tail) outliers. The plot of studentized residuals on the Y-axis against leverage values on the X-axis identifies two observations with leverage values exceeding 0.10, which may affect the results of the regression model substantially. The plot of the Cook's D-statistic (which detects observations with both high leverage values and outlying values) against observation number on the X-axis identifies two observations with large D-statistics near observation 270. Removing these observations or determining why they are so outlying and influential may affect the results and your interpretation of the model. The distribution plot and histogram of the residuals is approximately normally distributed though somewhat asymmetric. Compare the statistics and the plots for this model to those from your other two models to see if the addition of other independent variables improves the model fit and "accommodates" the outlying and influential observations.
08-17-2013 03:44 PM
thank you for the reply.
"In your first model, these plots resemble more a fan opening to the right, indictating that the variance may change with increasing size of the predicted value (that is, heteroskedasticity).Two large predicted values have very large negative studentized residuals, and several other predicted values with studentized residuals either larger or smaller than about two standard errors from zero, indicating potential outliers. " Did you mean with this sentences that the plots indicating inconclusive results? Actually interpretation of residuals a little bit complicated to me.
08-17-2013 03:58 PM
No, these plots and findings do NOT indicate inconclusive results. They indicate that the assumptions of ordinary least squares regression may not be fully met with your data. The outlying observations should be checked to find out why they may be outlying (for example, incorrectly recorded values). The influential observations should be checked to determine whether their omission from the analysis greatly affects the regression results. The changing variance of the residual patterns with the increasing size of the predicted value (heteroskedasticity) may indicate that the dependent variable should be transformed or that the observations should be weighted unequally (two possible "fixes" for heteroskedasticity. Most good regression textbooks describe how to interpret residual plots and how to fix possible problems indicated by such plots; read them.
08-17-2013 04:10 PM
Can you give me the name of the books? Today I had a quick look on the "Little SAS Book". But it was more about SAS programming and data analysis. I think I need more statistical foundations of econometric modelling.
08-17-2013 06:14 PM
Look at textbooks on regression in statistics. You can find a brief discussion of residuals in the SAS documentation for PROC REG. You can search for SAS books on regression at the SAS publication catalog: https://support.sas.com/pubscat/complete.jsp. You can search for regression books from other publishers at amazon.com or other online booksellers. Finally, search on Google for articles about "regression residuals".
08-17-2013 03:35 PM
Good question Dilschad.
The t-statistic is given by dividing the parameter estimate by the standard error.
- If the parameter estimate is high (strong predictor) and the standard error is low (consistent predictor) then our t-statistic will be large.
- However if the parameter estimate is low (weak predictor) and the standard error is high (inconsistent predictor) then our t-statistic will be small.
We can see how having strong predictors with relatively small inconsistencies will make a predictor more desirable to a statistician.
If we assume there is no actual connection between divm1 and MODEL1, and the value of one has no affect upon the value of the other, there must be some chance that our 384 observations would indicate that there was a strong relation between them (Parameter estimate = 0.71799) with relatively few inconsistencies (standard error = 0.02529). We would expect the existence of a well-fitting model appearing by chance to be very small, and to get smaller if more observations with similar trends were collected.
I have tried to find an explicit equation for how the specific probability is calculated, but Statistically it does not matter what the actual Pr>|t| value is, so long as it is below 0.05, the value that we have set at the beginning of the process. I have found a few sites outlining the procedure for finding confidence intervals for t-values but cannot find anything that matches the context of parameter estimates and my calculations appear inconsistent.
Would any other community members know exactly how the Pr>|t| is calculated or how confidence intervals are set?
08-17-2013 03:49 PM
The description of the SAS PROBT function in the SAS Language Reference shows how Pr>|t| is calculated from the calculated t-statistic and the number of error degrees of freedom. For a given statistical significance level, alpha, and a given number of error degrees of freedom, one can calculate the corresponding t-statistic using the SAS TINV function. With 384 observations, this two-tailed t-statistic is close to the corresponding z-statistic (~= 1.96). Therefore, you can use 0.71799 +/- 1.96*0.02429 to calculate the two-tailed 95% confidence interval for this regression coefficient.
08-18-2013 05:07 AM
Thanks for that 1zmm,
0.71799 +/- 1.96*0.02429 are our 95% confidence limits for our parameter estimate.
This relates to an interval of (0.670382, 0.765598). Because zero is not included in this interval we can say that we are at least 95% confident that the real parameter estimate is not in fact zero. An alternative wording for this is that if we repeated this study again and again with more observations that were independently collected, we would expect 95% of them to yeild a parameter estimate within the (0.670382, 0.765598) range.
08-17-2013 04:15 PM
I think as long as the p-value is below 0.05, i can reject null hypothesis. The following link states that alpha=0.05 by saying Pr > |t|- This column shows the 2-tailed p-values used in testing the null hypothesis that the coefficient (parameter) is 0. Using an alpha of 0.05