Is it statistically sound to include a form of the dependent variable ...

smithcl13 · Posted 08-28-2023 02:36 PM

This is my first time posting to the SAS community so thank you all for your help!

I have a more statistical theory question. In a linear regression model, would it be statistically sound to use the average of the dependent as an independent variable?

Specifically, I am building a model to predict the number of days until project completion. I am interested in creating a variable named “average days” which is the average amount of time to complete a project by zip code. For the aggregation, a given data point’s project time would not contribute to the average for that particular observation, but would be used for other data points in the same zip code. The new variable “average days” is not strongly correlated with other variables in the model (strongest correlation is 0.4 with one other variable that determines the likelihood that a permit will be needed for the job). I attached a sample of the code if interested.

Assuming there is no multicollinearity and/or model overfitting, would there be statistical or mathematical concerns with this method? If so, could you explain why?

PaigeMiller · Posted 08-28-2023 03:25 PM

In a linear regression model, would it be statistically sound to use the average of the dependent as an independent variable?

In general, no. You can't play with words and make some function of the dependent variable to be an independent variable.

Perhaps instead of doing what you described, maybe zip code could be a categorical predictor, or some other measure of the zip code such as average income or average age or something that you deem is relevant.

--
Paige Miller

smithcl13 · Posted 08-28-2023 04:01 PM

Thank you for your reply, would you mind explaining why, mathematically or with statistical assumptions, it wouldn't be feasible? I understand this is something that does not get done often, but am trying to understand specifically why it isn't.

PaigeMiller · Posted 08-29-2023 07:20 AM

The whole point of fitting a model to data is to determine what independent variables are useful in predicting the dependent variable. The inclusion of some function of the dependent variable into the model as an independent variable violates these fundamental principles.

In an extreme case, if you wanted to create a model to predict temperature Fahrenheit, you could use a function of temperature Fahrenheit such as temperature Celsius as an independent variable, and you would get an R-squared = 1. Obviously, this is not a valid use of modeling. But where do you draw the line, what functions are allowed and what functions are not allowed? I would say ... draw the line at not allowing ANY functions of the dependent variable into the model as independent variables.

--
Paige Miller

smithcl13 · Posted 08-29-2023 10:05 AM

For your Fahrenheit example, basically what I am asking is, what is the statistical issue with using the average of August 2022 temperature to predict August 29, 2023's temperature?

PaigeMiller · Posted 08-29-2023 10:18 AM

Seems like you have changed the question to use last year's temperature to predict this year's temperature. I don't have a problem with that. You are not using the dependent variable (this year's temperature) or a function of this year's temperature as an independent variable in the prediction.

--
Paige Miller

ballardw · Posted 08-29-2023 11:20 AM

@smithcl13 wrote:

For your Fahrenheit example, basically what I am asking is, what is the statistical issue with using the average of August 2022 temperature to predict August 29, 2023's temperature?

I was actually involved in analyzing the behavior of two different approaches to weather simulation such as daily max/min temperatures and precipitation. One model used a monthly parameterization, which was basically a monthly average plus a couple of range parameters to provide variety. When graphing a summary of simulated mean/max/min temperatures across calendar dates there was notable stair-step to the generated values. The first of the month would show a marked "jump" in temps and was pretty similar start to end of the month. When superimposing an historical record of similar summary the differences could be seen where the first/end of a month simulation did not track the more gradual recorded values.

You can read an online version of the PDF at the link below. Page 13 shows the graphs.

https://journals.ametsoc.org/view/journals/apme/35/10/1520-0450_1996_035_1878_swsoaa_2_0_co_2.xml

One of the conclusion we had was that using monthly parameters, if using the simulation to predict weather effects such as floods, would be that timing of likely events could be shifted considerably to start/end of calendar month. Also the rather abrupt changes would affect temperature related elements like power demands for heating/cooling or snow melt run-off volumes.

Tom · Posted 08-29-2023 09:38 PM

Check if you have SAS/ETS licensed. It has procedures for doing TIME SERIES analysis, which is what your are proposing.

ballardw · Posted 08-28-2023 04:04 PM

@smithcl13 wrote:

This is my first time posting to the SAS community so thank you all for your help!

I have a more statistical theory question. In a linear regression model, would it be statistically sound to use the average of the dependent as an independent variable?

Specifically, I am building a model to predict the number of days until project completion. I am interested in creating a variable named “average days” which is the average amount of time to complete a project by zip code. For the aggregation, a given data point’s project time would not contribute to the average for that particular observation, but would be used for other data points in the same zip code. The new variable “average days” is not strongly correlated with other variables in the model (strongest correlation is 0.4 with one other variable that determines the likelihood that a permit will be needed for the job). I attached a sample of the code if interested.

How would that aggregate be used for other data points? Is this to impute some value that is occasionally missing for some records?

I am afraid that SQL code doesn't really help this discussion.

mkeintz · Posted 08-29-2023 07:26 PM

For each observaion i, you propose to use as a regressor the mean of the dependent variable y, excluding the i'th value of y.

But I believe this means y_i is really being regressed on itself. Consider the algebra below, starting with my understanding of your proposed regression model:

y_i = α + β₁(Mean of y excluding y_i) + β₂x_i + … + е_i

which, after a little algebra (remember ny̅ is sum of all y_i ) :

₌α + β₁(ny̅ - y_i)/(n-1) + β₂x_i + … + е_i

_{which becomes}

₌α + β₁ny̅/(n-1) - β₁y_i/(n-1) + β₂x_i + … + е_i

_{The first two terms above, α and β1ny̅/(n-1), are just constants. So just define a new constant term,}

α_{2 ≡}α + β₁ny̅/(n-1)

_{which means yo}_{ur regression is really}

= α₂ - β₁y_i/(n-1) + β₂x_i + … + е_i

This looks like a regression of y_i on itself. In fact, I guess it would yield an r-squared of one, which is what happens when I simulated below my understanding of your proposal:

data have ;
  set sashelp.class (in=firstpass)   sashelp.class (in=second_pass);
  if firstpass then total_wgt+weight;

  if second_pass;
  mean_excluding_current_wgt=(total_wgt-weight)/18;
run;


proc reg data=have;
  model weight=age mean_excluding_current_wgt;
  run;
quit;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Is it statistically sound to include a form of the dependent variable as an independent variable?

Re: Is it statistically sound to include a form of the dependent variable as an independent variable

Re: Is it statistically sound to include a form of the dependent variable as an independent variable

Re: Is it statistically sound to include a form of the dependent variable as an independent variable

Re: Is it statistically sound to include a form of the dependent variable as an independent variable

Re: Is it statistically sound to include a form of the dependent variable as an independent variable

Re: Is it statistically sound to include a form of the dependent variable as an independent variable

Re: Is it statistically sound to include a form of the dependent variable as an independent variable

Re: Is it statistically sound to include a form of the dependent variable as an independent variable

Re: Is it statistically sound to include a form of the dependent variable as an independent variable

The 2025 SAS Hackathon has begun!