This is my first time posting to the SAS community so thank you all for your help!
I have a more statistical theory question. In a linear regression model, would it be statistically sound to use the average of the dependent as an independent variable?
Specifically, I am building a model to predict the number of days until project completion. I am interested in creating a variable named “average days” which is the average amount of time to complete a project by zip code. For the aggregation, a given data point’s project time would not contribute to the average for that particular observation, but would be used for other data points in the same zip code. The new variable “average days” is not strongly correlated with other variables in the model (strongest correlation is 0.4 with one other variable that determines the likelihood that a permit will be needed for the job). I attached a sample of the code if interested.
Assuming there is no multicollinearity and/or model overfitting, would there be statistical or mathematical concerns with this method? If so, could you explain why?
In a linear regression model, would it be statistically sound to use the average of the dependent as an independent variable?
In general, no. You can't play with words and make some function of the dependent variable to be an independent variable.
Perhaps instead of doing what you described, maybe zip code could be a categorical predictor, or some other measure of the zip code such as average income or average age or something that you deem is relevant.
Thank you for your reply, would you mind explaining why, mathematically or with statistical assumptions, it wouldn't be feasible? I understand this is something that does not get done often, but am trying to understand specifically why it isn't.
The whole point of fitting a model to data is to determine what independent variables are useful in predicting the dependent variable. The inclusion of some function of the dependent variable into the model as an independent variable violates these fundamental principles.
In an extreme case, if you wanted to create a model to predict temperature Fahrenheit, you could use a function of temperature Fahrenheit such as temperature Celsius as an independent variable, and you would get an R-squared = 1. Obviously, this is not a valid use of modeling. But where do you draw the line, what functions are allowed and what functions are not allowed? I would say ... draw the line at not allowing ANY functions of the dependent variable into the model as independent variables.
For your Fahrenheit example, basically what I am asking is, what is the statistical issue with using the average of August 2022 temperature to predict August 29, 2023's temperature?
Seems like you have changed the question to use last year's temperature to predict this year's temperature. I don't have a problem with that. You are not using the dependent variable (this year's temperature) or a function of this year's temperature as an independent variable in the prediction.
@smithcl13 wrote:
For your Fahrenheit example, basically what I am asking is, what is the statistical issue with using the average of August 2022 temperature to predict August 29, 2023's temperature?
I was actually involved in analyzing the behavior of two different approaches to weather simulation such as daily max/min temperatures and precipitation. One model used a monthly parameterization, which was basically a monthly average plus a couple of range parameters to provide variety. When graphing a summary of simulated mean/max/min temperatures across calendar dates there was notable stair-step to the generated values. The first of the month would show a marked "jump" in temps and was pretty similar start to end of the month. When superimposing an historical record of similar summary the differences could be seen where the first/end of a month simulation did not track the more gradual recorded values.
You can read an online version of the PDF at the link below. Page 13 shows the graphs.
https://journals.ametsoc.org/view/journals/apme/35/10/1520-0450_1996_035_1878_swsoaa_2_0_co_2.xml
One of the conclusion we had was that using monthly parameters, if using the simulation to predict weather effects such as floods, would be that timing of likely events could be shifted considerably to start/end of calendar month. Also the rather abrupt changes would affect temperature related elements like power demands for heating/cooling or snow melt run-off volumes.
Check if you have SAS/ETS licensed. It has procedures for doing TIME SERIES analysis, which is what your are proposing.
@smithcl13 wrote:
This is my first time posting to the SAS community so thank you all for your help!
I have a more statistical theory question. In a linear regression model, would it be statistically sound to use the average of the dependent as an independent variable?
Specifically, I am building a model to predict the number of days until project completion. I am interested in creating a variable named “average days” which is the average amount of time to complete a project by zip code. For the aggregation, a given data point’s project time would not contribute to the average for that particular observation, but would be used for other data points in the same zip code. The new variable “average days” is not strongly correlated with other variables in the model (strongest correlation is 0.4 with one other variable that determines the likelihood that a permit will be needed for the job). I attached a sample of the code if interested.
How would that aggregate be used for other data points? Is this to impute some value that is occasionally missing for some records?
I am afraid that SQL code doesn't really help this discussion.
For each observaion i, you propose to use as a regressor the mean of the dependent variable y, excluding the i'th value of y.
But I believe this means yi is really being regressed on itself. Consider the algebra below, starting with my understanding of your proposed regression model:
yi = α + β1(Mean of y excluding yi) + β2xi + … + еi
which, after a little algebra (remember ny̅ is sum of all yi ) :
= α + β1 (ny̅ - yi)/(n-1) + β2xi + … + еi
which becomes
= α + β1 ny̅/(n-1) - β1yi/(n-1) + β2xi + … + еi
The first two terms above, α and β1ny̅/(n-1), are just constants. So just define a new constant term,
α2 ≡ α + β1 ny̅/(n-1)
which means your regression is really
= α2 - β1yi/(n-1) + β2xi + … + еi
This looks like a regression of yi on itself. In fact, I guess it would yield an r-squared of one, which is what happens when I simulated below my understanding of your proposal:
data have ;
set sashelp.class (in=firstpass) sashelp.class (in=second_pass);
if firstpass then total_wgt+weight;
if second_pass;
mean_excluding_current_wgt=(total_wgt-weight)/18;
run;
proc reg data=have;
model weight=age mean_excluding_current_wgt;
run;
quit;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.