topic Re: Force a regression coefficient to be negative in Statistical Procedures

Force a regression coefficient to be negative

wutao9999 — Thu, 09 Apr 2015 14:09:31 GMT

I have a regression problem using variable X to predict Y. That is, Y = c + A*X + error.

For the regression problem, we need that A must be negative to make the regression result meaningful. However, due to existence of unknown noises or unknown factors, our regression sometimes does have a positive results of coefficient A. I am struggling to find out a statistical way to force coefficient A being negative. Do you know any way to do this?

One way I am thinking is that: If the results of A is positive, I remove one point that is most influential to cause A being positive. After removing the point, then do regression again. By doing this iteratively, after removing a couple of data points, the result of A can be negative. Is there any statistical method in research literature supporting my way to remove a couple of data points to force regression coefficient A within a range we prefer (such as negative)? Appreciate your answer.

Re: Force a regression coefficient to be negative

PaigeMiller — Thu, 09 Apr 2015 15:33:06 GMT

There are mathematical techniques which can impose such a restriction on a linear regression, and you can even read about some of them under “linear least squares” at Wikipedia. You'd have to program this in PROC IML or search to see if someone else has done this already.

Your proposed method could also be programmed either via PROC IML or via a macro loop.

I am somewhat skeptical about the validity of either approach, one of the reasons we use data to estimate things is to see if the data is consistent with the underlying theory, and if you get a positive slope, that is telling you something! If it were me, I'd see if there were substantive reasons to remove certain pieces of data (e.g. recording errors, or the process by which the data was generated was compromised, etc.) and if there were no obvious reasons to remove certain pieces of data, then you've got a real problem ... the data does not support the theory. In that case, I think its pointless to force the slope to be negative, I would not feel that there is any validity to such an estimate.

Re: Force a regression coefficient to be negative

JacobSimonsen — Thu, 09 Apr 2015 16:12:12 GMT

I agree with that it may not be wise to make such a bound on the parameters. Also, in case some of estimates happens to be on the bound the p-values will not be valid.

Nevertheless, if you still insist to do this, it is easy to put restrictions on the parameters in nlmixed, which can handle most regression models. I found this example in the SAS-documentation where they use a bound on the parameters: http://support.sas.com/documentation/cdl/en/statug/67523/HTML/default/viewer.htm#statug_nlmixed_examples03.htm

Jacob

Re: Force a regression coefficient to be negative

wutao9999 — Thu, 09 Apr 2015 17:35:49 GMT

Can Least square linear regression be used in nlmixed? Thank you.

Re: Force a regression coefficient to be negative

PaigeMiller — Thu, 09 Apr 2015 17:40:24 GMT

As far as I know, NLMIXED produced Maximum Likelihood or Restricted Maximum Likelihood estimate and not Ordinary Least Squares estimates. There may be some situations where the OLS and ML or REML estimates are guaranteed to be the same.

Re: Force a regression coefficient to be negative

JacobSimonsen — Thu, 09 Apr 2015 17:44:26 GMT

Hi PaigeMiller,

If I remember right, if you consider your data to be normal distributed, then the maximum likelihood will also be the estimate obtained by ordinary least squares. Therefore I think also that nlmixed can solve the problem.

Good luck. Jacob

Re: Force a regression coefficient to be negative

lvm — Thu, 09 Apr 2015 18:53:14 GMT

I would definitely not force the coefficient to be negative in the overall estimation. Would give meaningless results. However, you can determine the influence of each observation on the estimated parameters (and on tyhe estimated fit). In PROC REG, use:

model y = x / influence;

THis will will give all kinds of information for each observation, including cook's distance (change in normalized coefficients with the deletion of each individual). You can read about this in the REG chapter.

Re: Force a regression coefficient to be negative

PaigeMiller — Thu, 09 Apr 2015 19:50:56 GMT

If I remember right, if you consider your data to be normal distributed, then the maximum likelihood will also be the estimate obtained by ordinary least squares.

It's the errors, not the data itself, that have to be normally distributed.

Re: Force a regression coefficient to be negative

stat_sas — Fri, 10 Apr 2015 02:32:15 GMT

If not concerned about the interpretation, just multiply X by -1 and run regression. This will give you negative A.

Re: Force a regression coefficient to be negative

PaigeMiller — Fri, 10 Apr 2015 13:40:27 GMT

stat@sas wrote:

If not concerned about the interpretation, just multiply X by -1 and run regression. This will give you negative A.

well, um ...

How would you ever justify this method to a reviewer or colleague?

Re: Force a regression coefficient to be negative

PaigeMiller — Fri, 10 Apr 2015 13:47:37 GMT

JacobSimonsen wrote:

I have been thinking about this. If you have data that without restrictions, has a positive slope, and you restrict the estimation method to produce a negative slope, wouldn't you get something negative but extremely close to zero? (I admit I haven't tried this on actual data.) And if that is what happens, then again I ask what is the validity of such an estimate, when the data contradicts or does not support the established theory that the slope must be negative?

A similar situation: Isn't this what happens in variance components estimation when the type 1 estimate would be negative, but when you use REML or ML and restrict the variance component to be non-negative, you usually wind up with a zero estimate.

Re: Force a regression coefficient to be negative

PaigeMiller — Fri, 10 Apr 2015 14:35:32 GMT

wutao9999 wrote:

One way I am thinking is that: If the results of A is positive, I remove one point that is most influential to cause A being positive. After removing the point, then do regression again.

I have been thinking about this. Once you fit the regression and get a positive slope, the regression diagnostics that SAS produces might show that the most influential point could be pulling the regression slope to be negative, and without that data point, the slope would be more positive (i.e. deleting this point would move the slope in the wrong direction). I think you'd have to specify that you would delete only the most influential point that is making the slope more positive and without this point the slope would be negative, or closer to negative (i.e. moving in the desired direction).

It will eventually produce a result that has a negative slope, but I don't see how this method can be justified.

Re: Force a regression coefficient to be negative

wutao9999 — Fri, 10 Apr 2015 14:53:30 GMT

Thanks a lot for all the useful comments.

I do agree that it is danger to remove points without a scientific ground. I would like to propose a new way to see if we agree on it...

At first, I want to emphasize that, for the problem we have, the coefficient A must be negative, otherwise it doesn't make business sense at all. As we all know, the reason why the data still yields a positive coefficient A is that there is unknown factor affecting the data.

Second, generally, like the points {(low X value, low Y value), (high X value, high Y value)}, these kinds of points usually are most influential ones causing A being positive. For the problem we have, these points do not follow business common sense because a high X value should relate to a low Y value. However, it does exist in the data because of other unknown factor plays a major impact here. I am wondering that, instead of removing these points, we should add a new event variable for each of these points to account for the unknown factor, by adding the event variable into the model iteratively for the most influential points sequentially, we can eventually get a negative coefficient A. Does anyone know any existing methods for doing this kind of work? Thanks a lot.

Re: Force a regression coefficient to be negative

PaigeMiller — Fri, 10 Apr 2015 15:10:51 GMT

I'm sure people could think of dozens of algorithms to take this data and obtain a negative slope.

I propose a very simple algorithm, for which the slope will always be negative.

data find_slope;

slope= -2.3;

run;

Doesn't get any simpler than that!

My point is that I don't know how you can justify any of these algorithms that are designed to take data with a positive slope and turn it into a negative slope estimate. I doubt you could get such a thing published, and I think your colleagues would scoff as well. Just because you can do a certain calculation doesn't mean you should do that calculation.

Now, if there is an unknown factor affecting the data, then perhaps you ought to spend your efforts trying to identify what that factor is and then mathematically or theoretically remove it from the data. In fact, if there is such a factor affecting the slope, you would be wise to search for it and account for it, as it is overwhelming the signal you think you are going to find. Or, as I said way back in my first post on this thread "If it were me, I'd see if there were substantive reasons to remove certain pieces of data (e.g. recording errors, or the process by which the data was generated was compromised, etc.)"

But of all of the algorithms discussed so far, none of which I think can be justified, I like mine the best. Simple and effective!

Re: Force a regression coefficient to be negative

wutao9999 — Fri, 10 Apr 2015 15:24:37 GMT

PaigeMiller: There is a paradox here. It is apparently that we know there are abnormalities in the data causing the positive coefficient A, we all know we should find out the reason why they are happening. But for the god sake, no one could ever find it out.

If we do nothing about the data, then we get a positive A. Yes, I agree with you, we should just present a positive A to the business. Then, you will be responded. THE ANALYSIS IS A NOSSIANCE. NO ONE WOULD EVER BELIEVE IT CAN BE POSITIVE.

I guess our conclusion is that: We had NOTHING.

Re: Force a regression coefficient to be negative

PaigeMiller — Fri, 10 Apr 2015 15:28:36 GMT

Yes, I agree with that conclusion, the data cannot be used to estimate the slope.

Re: Force a regression coefficient to be negative

lvm — Fri, 10 Apr 2015 16:11:37 GMT

wutao9999,

you are in very dangerous territory here, and most of your ideas on parameter estimation cannot be supported. It would never get past peer review. There are some formal things you can do, but the specifics will depend on your system. Your surprising parameter estimate (slope) could be do to one or more highly influential observations (outliers or high leverage values), or to the presence of other unmeasured variables that are (highly) correlated with x or the error (residual) of the model. For the latter, there are various things you can do, if you have the other measured variables. Use of instrument variables is one possibility. I don't think you have other variables.

For the former, there are things you can do to formally look at influential variables. I told you about this in my previous post. You look at, for instance, DFBETAS for each observation to see the influence of the observation on the parameter estimate. I like to look at Cook's distance also. You might be able to justify dropping an excessively influential observation if you can identify something wrong with it. Dropping it because you don't like the result is not a valid reason. You can also switch to robust regression methods to discount the influence of extreme observations (extreme in terms of x and/or y). This is often most beneficial with small data sets, but can be used for any. I demonstrate this with an example below (just run the sas code). I made up a very small data set that would have a negative slope except for one (really extreme) observation. The REG output clearly shows the influence of this value on the slope and intercept estimate (you don't need fancy methods for this extreme example, but it is a demonstration of DFBETAS, Cook's distance, and so on). The slope estimate is positive because of this observation. If you use ROBUSTREG (even with default estimation), that extreme observation (although still in the dataset) does not have much of an influence on the results; the estimated slope is now negative.

As an alternative to the above approaches, you can take a Bayesian approach. If you are certain that the slope must be negative based on past work, then you could place a highly informative prior on the slope. In my example below, I used GENMOD and chose a normal prior for the slope and intercept. For the informative normal prior on the slope, 95% of the distribution mass is (roughly) between -1.5 and -0.5 (in my arbitrary example). As you can see if you run my code, the mean of the posterior for the slope is about -1 with this informative prior, but the posterior mean slope is a positive 0.3 with a noninformative prior (also in my example below). But you should realize, the Bayesian estimate of the slope (with informative prior) is not the slope that best fits the observations. Rather, the posterior distribution combines prior knowledge and the new information from the current data set. With a highly informative prior, the mean of the posterior may not give a good fit to the current data. Interpretation is different.

I only touched on Bayesian approaches, just to give you an idea. I come back to my original point: you ideas for this problem cannot be supported.

data a;

input x y;

datalines;

0 50

1 45

2 47

3 45

4 44

5 40

6 38

7 41

8 80

9 35

;

run;

*see the big influence of the one observation on the coefficients (dfbetas);

*this gives positive (but not significant here) slope estimate;

proc reg data=a plots=(CooksD RStudentByLeverage DFFITS DFBETAS);

model y = x / influence;

run;

*robust regression automatically diminishes the influence of an outlier;

*now the estimated slope is negative;

proc robustreg data=a;

model y = x;

run;

*do a MLE for the same linear model;

proc genmod data=a;

model y = x;

run;

*now do a Bayesian analysis (with noninformative normal priors);

*note that mean of posterior distribution for x (the slope) is positive;

*(but with credible interval for slope covering 0);

proc genmod data=a;

model y = x;

bayes coefprior=normal; *noninformative normal priors (i.e., very large variances);

run;

*Informative prior (posterior mean for slope is mostly from -1.5 and -0.5, centered on -1);

*Non-informative prior for intercept (large variance);

data normprior;

input _type_ $ x intercept; *<--use x for the slope and intercept for intercept;

datalines;

Var 0.0625 100000

Mean -1.0 0

;

run;

*with highly informative prior for slope, one gets negative mean for the poserior for slope;

proc genmod data=a;

model y = x;

bayes coefprior=normal(input=normprior);

run;

Re: Force a regression coefficient to be negative

wutao9999 — Fri, 10 Apr 2015 16:28:35 GMT

Very helpful. Appreciate your comments...

Re: Force a regression coefficient to be negative

wutao9999 — Fri, 10 Apr 2015 16:29:35 GMT

I like your comments. Direct and helpful..

Re: Force a regression coefficient to be negative

lvm — Fri, 10 Apr 2015 20:41:22 GMT

Please note: my Bayesian example uses a very strong prior distribution. One would need a very strong argument to justify such a prior. I used it to demonstrate the concept.