BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
StellaPals
Obsidian | Level 7

I am writing my thesis and wanted to make a linear regression model, but unfortunately by data is not normally distributed. The assumptions of the linear regression model are the normal distribution of residuals and the constant variance of residuals, which are not satisfied in my case. My supervisor told me that: "You could create a regression model. As long as you don't start discussing the significance of the parameters, the model can be used for descriptive purposes." Is it really true? How can I describe a model like this for example:

grade = - 4.7 + 0.4*(math_exam_score)+0.1*(sex) 

if the variables might not even be relevant (can I even say how big the effect was? for example if math exam score is one point higher then the grade was 0.4 higher?)? Also the R square is quite low (on some models 7%, some have like 35% so it isn't even that good at describing the grade..)

 

Also if I were to create that model, I have some conflicting exams (for example english exam score that can be either taken as a native or there is a simpler exam for those that are learning it as a second language). So there are very few (if any) that took both of these exams (native and second). Therefor, I can't really put both of these in the model, I would have to make two different ones. But since the same case is with a math exam (one is simpler, one is harder) and a extra exam (that only a few people took), it would in the end take 8 models (1. simpler math & native english & sex, 2. harder math & native english & sex, 1. simpler math & english as a second language & sex, .... , simpler math & native english & sex & extra exam). Seems pointless....

 

Any ideas? Thank you 🙂

1 ACCEPTED SOLUTION

Accepted Solutions
Season
Barite | Level 11

@PaigeMiller wrote:

@Season wrote:

As long as the residuals have means (i.e., mathematical expectations) equal zero ...


Do you mean "As long as the errors have means equal zero"?


Yes, I do. "Errors" and "residuals" mean the same thing in regression.


@PaigeMiller wrote:

I was talking about the conditions needed for an OLS estimator to be unbiased. It is because of the unbiasedness of estimators that @StellaPals's supervisor's words are tenable.

 

People find biased estimates are usable and useful in the right situations. I did not interpret the supervisor's words as related to the estimators being unbiased.


It is more appropriate to pharse your first quoted sentence as "People find certain biased estimators are usable and useful in the right situations". Biased estimators must have some nice properties that make them outperform their unbiased counterparts. If none of the assumptions of multiple linear regression are met, then OLS is simply a certain kind of biased estimator. There is no guarantee that is has certain nice properties that persuades people to continue using it despite its biasedness. In that situation, one can still use OLS anyway, but why not resort to something else?

Therefore, simply use OLS in the presence of violation of all assumptions is not the best way to deal with the problem. More advanced modeling methods have to be employed. In other words, @StellaPals's supervisor's words is incorrect in that scenario.

View solution in original post

9 REPLIES 9
StellaPals
Obsidian | Level 7
Also, if the assumptions were satisfied, and I made n separate models (grade = sex, grade= math_exam and so on), would I need to use bonferron correction (0.05/n)? Or would I still compare p-values to just 0.05?
Season
Barite | Level 11

@StellaPals wrote:
Also, if the assumptions were satisfied, and I made n separate models (grade = sex, grade= math_exam and so on), would I need to use bonferron correction (0.05/n)? Or would I still compare p-values to just 0.05?

This has something to do with multiple comparisons and multiple tests, which is closely related to the specific question of the investigator. Briefly, if answering your question entails multiple hypothesis testing procedures, then it is necessary to employ such methods. For instance, if I have 20 groups of students that have finished an English exam and my research question is that whether the mean of English exams scores of all of the groups are different, then I have C(20,2)=20!/(2!*18!) comparisons to do. Therefore, I have to use some method to correct the "raw" values and compare the corrected P values with the threshold of statistical significance, which is usually 0.05. On the other hand, if I am only interested in whether the scores of group 1 and group 5 are different, then there is only one hypothesis testing procedure needed. In this case, using the "raw" P value and compare it with the statistical significance threshold suffices.

By the way, the Bonferroni method is only one way of dealing with multiple comparisons and multiple tests. You can employ this method for P value correction while other choices can also be used. For a more detailed introduction of multiple comparison methods, the circumstances that suitable for each of these methods and how to realize them in SAS, see Amazon.com: Multiple Comparisons and Multiple Tests Using SAS, Second Edition: 9781607647836: Westfa....

Season
Barite | Level 11

@StellaPals wrote:

I am writing my thesis and wanted to make a linear regression model, but unfortunately by data is not normally distributed. The assumptions of the linear regression model are the normal distribution of residuals and the constant variance of residuals, which are not satisfied in my case. My supervisor told me that: "You could create a regression model. As long as you don't start discussing the significance of the parameters, the model can be used for descriptive purposes." Is it really true?


As long as the residuals have means (i.e., mathematical expectations) equal zero, the ordinary least squares (OLS) estimator of linear regression coefficients is unbiased. Therefore, your supervisor's words is true as long as the foregoing assumption holds. But I need to point out that other nice properties of the OLS estimator are undermined if not all of the assumptions regarding of linear regression are satisfied.

PaigeMiller
Diamond | Level 26

I don't think you have to assume that the mean of the residuals is zero. I think you can always fit the line using the least squares algorithm, you don't need to make any assumptions, and the algorithm will always produce a line that minimizes the sum of squares of the residuals (hence the name "Least Squares"). Then you can use this fitted line (if you want) to make predictions or to state that it has a certain slope and intercept which minimizes the sum of squares of the residuals; no assumptions needed. Once you start hypothesis testing or computing confidence intervals, assumptions are needed.

 

As a consequence of fitting via this algorithm, residuals have a mean of zero.

--
Paige Miller
Season
Barite | Level 11

Please take a second look at my message. Here is an excerpt:

As long as the residuals have means (i.e., mathematical expectations) equal zero, the ordinary least squares (OLS) estimator of linear regression coefficients is unbiased.

I was talking about the conditions needed for an OLS estimator to be unbiased. It is because of the unbiasedness of estimators that @StellaPals's supervisor's words are tenable.

You can use OLS to fit a line even if your data meet none of the assumptions of linear regression, but in that case, the OLS estimator may not be unbiased. Therefore, @StellaPals's supervisor's words may be incorrect.


PaigeMiller
Diamond | Level 26

@Season wrote:

As long as the residuals have means (i.e., mathematical expectations) equal zero ...


Do you mean "As long as the errors have means equal zero"?

 

I was talking about the conditions needed for an OLS estimator to be unbiased. It is because of the unbiasedness of estimators that @StellaPals's supervisor's words are tenable.

 

People find biased estimates are usable and useful in the right situations. I did not interpret the supervisor's words as related to the estimators being unbiased.

--
Paige Miller
Season
Barite | Level 11

@PaigeMiller wrote:

@Season wrote:

As long as the residuals have means (i.e., mathematical expectations) equal zero ...


Do you mean "As long as the errors have means equal zero"?


Yes, I do. "Errors" and "residuals" mean the same thing in regression.


@PaigeMiller wrote:

I was talking about the conditions needed for an OLS estimator to be unbiased. It is because of the unbiasedness of estimators that @StellaPals's supervisor's words are tenable.

 

People find biased estimates are usable and useful in the right situations. I did not interpret the supervisor's words as related to the estimators being unbiased.


It is more appropriate to pharse your first quoted sentence as "People find certain biased estimators are usable and useful in the right situations". Biased estimators must have some nice properties that make them outperform their unbiased counterparts. If none of the assumptions of multiple linear regression are met, then OLS is simply a certain kind of biased estimator. There is no guarantee that is has certain nice properties that persuades people to continue using it despite its biasedness. In that situation, one can still use OLS anyway, but why not resort to something else?

Therefore, simply use OLS in the presence of violation of all assumptions is not the best way to deal with the problem. More advanced modeling methods have to be employed. In other words, @StellaPals's supervisor's words is incorrect in that scenario.

Season
Barite | Level 11

@StellaPals wrote:

How can I describe a model like this for example:

grade = - 4.7 + 0.4*(math_exam_score)+0.1*(sex) 

if the variables might not even be relevant (can I even say how big the effect was? for example if math exam score is one point higher then the grade was 0.4 higher?)? Also the R square is quite low (on some models 7%, some have like 35% so it isn't even that good at describing the grade..)


Whether a variable is "relevant" can be examined in two ways: subject matter knowledge and statistical significance. You use subject matter knowledge to select the variable you wish to put in the model in an a priori manner. Then, statistical significance of each variable is tested in SAS. You can describe the association of each independent variable and the dependent variable regardless of the value R square takes.


@StellaPals wrote:

Also if I were to create that model, I have some conflicting exams (for example english exam score that can be either taken as a native or there is a simpler exam for those that are learning it as a second language). So there are very few (if any) that took both of these exams (native and second). Therefor, I can't really put both of these in the model, I would have to make two different ones. But since the same case is with a math exam (one is simpler, one is harder) and a extra exam (that only a few people took), it would in the end take 8 models (1. simpler math & native english & sex, 2. harder math & native english & sex, 1. simpler math & english as a second language & sex, .... , simpler math & native english & sex & extra exam). Seems pointless....

 

Any ideas? Thank you 🙂


I am not sure on the topic you wish to investigate. For instance, does the simplicity of exams matter? You have to figure out the question you wish to answer before you build statistical models.

sas-innovate-white.png

Missed SAS Innovate in Orlando?

Catch the best of SAS Innovate 2025 — anytime, anywhere. Stream powerful keynotes, real-world demos, and game-changing insights from the world’s leading data and AI minds.

 

Register now

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 9 replies
  • 1010 views
  • 3 likes
  • 4 in conversation