BookmarkSubscribeRSS Feed
wutao9999
Obsidian | Level 7

I have a regression problem using variable X to predict Y. That is, Y = c + A*X + error.

For the regression problem, we need that A must be negative to make the regression result meaningful.  However, due to existence of unknown noises or unknown factors, our regression sometimes does have a positive results of coefficient A. I am struggling to find out a statistical way to force coefficient A being negative.  Do you know any way to do this?

One way I am thinking is that: If the results of A is positive, I remove one point that is most influential to cause A being positive.  After removing the point, then do regression again.  By doing this iteratively, after removing a couple of data points, the result of A can be negative.  Is there any statistical method in research literature supporting my way to remove a couple of data points to force regression coefficient A within a range we prefer (such as negative)?  Appreciate your answer.

25 REPLIES 25
PaigeMiller
Diamond | Level 26

There are mathematical techniques which can impose such a restriction on a linear regression, and you can even read about some of them under “linear least squares” at Wikipedia. You'd have to program this in PROC IML or search to see if someone else has done this already.

Your proposed method could also be programmed either via PROC IML or via a macro loop.

I am somewhat skeptical about the validity of either approach, one of the reasons we use data to estimate things is to see if the data is consistent with the underlying theory, and if you get a positive slope, that is telling you something! If it were me, I'd see if there were substantive reasons to remove certain pieces of data (e.g. recording errors, or the process by which the data was generated was compromised, etc.) and if there were no obvious reasons to remove certain pieces of data, then you've got a real problem ... the data does not support the theory. In that case, I think its pointless to force the slope to be negative, I would not feel that there is any validity to such an estimate.

--
Paige Miller
JacobSimonsen
Barite | Level 11

I agree with that it may not be wise to make such a bound on the parameters. Also, in case some of estimates happens to be on the bound the p-values will not be valid.

Nevertheless, if you still insist to do this, it is easy to put restrictions on the parameters in nlmixed, which can handle most regression models. I found this example in the SAS-documentation where they use a bound on the parameters: http://support.sas.com/documentation/cdl/en/statug/67523/HTML/default/viewer.htm#statug_nlmixed_exam...

Jacob

wutao9999
Obsidian | Level 7

Can Least square linear regression be used in nlmixed? Thank you.

PaigeMiller
Diamond | Level 26

As far as I know, NLMIXED produced Maximum Likelihood or Restricted Maximum Likelihood estimate and not Ordinary Least Squares estimates. There may be some situations where the OLS and ML or REML estimates are guaranteed to be the same.

--
Paige Miller
JacobSimonsen
Barite | Level 11

Hi PaigeMiller,

If I remember right, if you consider your data to be normal distributed, then the maximum likelihood will also be the estimate obtained by ordinary least squares. Therefore I think also that nlmixed can solve the problem.

Good luck. Jacob

lvm
Rhodochrosite | Level 12 lvm
Rhodochrosite | Level 12

I would definitely not force the coefficient to be negative in the overall estimation. Would give meaningless results. However, you can determine the influence of each observation on the estimated parameters (and on tyhe estimated fit). In PROC REG, use:

model y = x / influence;

THis will will give all kinds of information for each observation, including cook's distance (change in normalized coefficients with the deletion of each individual). You can read about this in the REG chapter.

PaigeMiller
Diamond | Level 26

If I remember right, if you consider your data to be normal distributed, then the maximum likelihood will also be the estimate obtained by ordinary least squares.

It's the errors, not the data itself, that have to be normally distributed.

--
Paige Miller
PaigeMiller
Diamond | Level 26

JacobSimonsen wrote:

Nevertheless, if you still insist to do this, it is easy to put restrictions on the parameters in nlmixed, which can handle most regression models. I found this example in the SAS-documentation where they use a bound on the parameters: http://support.sas.com/documentation/cdl/en/statug/67523/HTML/default/viewer.htm#statug_nlmixed_exam...

I have been thinking about this. If you have data that without restrictions, has a positive slope, and you restrict the estimation method to produce a negative slope, wouldn't you get something negative but extremely close to zero? (I admit I haven't tried this on actual data.) And if that is what happens, then again I ask what is the validity of such an estimate, when the data contradicts or does not support the established theory that the slope must be negative?

A similar situation: Isn't this what happens in variance components estimation when the type 1 estimate would be negative, but when you use REML or ML and restrict the variance component to be non-negative, you usually wind up with a zero estimate.

--
Paige Miller
wutao9999
Obsidian | Level 7

Thanks a lot for all the useful comments.

I do agree that it is danger to remove points without a scientific ground. I would like to propose a new way to see if we agree on it...

At first, I want to emphasize that, for the problem we have, the coefficient A must be negative, otherwise it doesn't make business sense at all.  As we all know, the reason why the data still yields a positive coefficient A is that there is unknown factor affecting the data.

Second, generally, like the points {(low X value, low Y value), (high X value, high Y value)}, these kinds of points usually are most influential ones causing A being positive.  For the problem we have, these points do not follow business common sense because a high X value should relate to a low Y value.  However, it does exist in the data because of other unknown factor plays a major impact here.  I am wondering that, instead of removing these points, we should add a new event variable for each of these points to account for the unknown factor, by adding the event variable into the model iteratively for the most influential points sequentially, we can eventually get a negative coefficient A. Does anyone know any existing methods for doing this kind of work?  Thanks a lot.

PaigeMiller
Diamond | Level 26

I'm sure people could think of dozens of algorithms to take this data and obtain a negative slope.

I propose a very simple algorithm, for which the slope will always be negative.

data find_slope;

     slope= -2.3;

run;

Doesn't get any simpler than that!

My point is that I don't know how you can justify any of these algorithms that are designed to take data with a positive slope and turn it into a negative slope estimate. I doubt you could get such a thing published, and I think your colleagues would scoff as well. Just because you can do a certain calculation doesn't mean you should do that calculation.

Now, if there is an unknown factor affecting the data, then perhaps you ought to spend your efforts trying to identify what that factor is and then mathematically or theoretically remove it from the data. In fact, if there is such a factor affecting the slope, you would be wise to search for it and account for it, as it is overwhelming the signal you think you are going to find. Or, as I said way back in my first post on this thread "If it were me, I'd see if there were substantive reasons to remove certain pieces of data (e.g. recording errors, or the process by which the data was generated was compromised, etc.)"

But of all of the algorithms discussed so far, none of which I think can be justified, I like mine the best. Simple and effective!

--
Paige Miller
wutao9999
Obsidian | Level 7

PaigeMiller: There is a paradox here. It is apparently that we know there are abnormalities in the data causing the positive coefficient A, we all know we should find out the reason why they are happening.  But for the god sake, no one could ever find it out.

If we do nothing about the data, then we get a positive A.  Yes, I agree with you, we should just present a positive A to the business.  Then, you will be responded.  THE ANALYSIS IS A NOSSIANCE.  NO ONE WOULD EVER BELIEVE IT CAN BE POSITIVE.

I guess our conclusion is that:   We had NOTHING.

PaigeMiller
Diamond | Level 26

Yes, I agree with that conclusion, the data cannot be used to estimate the slope.

--
Paige Miller
wutao9999
Obsidian | Level 7

I like your comments.  Direct and helpful..

lvm
Rhodochrosite | Level 12 lvm
Rhodochrosite | Level 12

Please note: my Bayesian example uses a very strong prior distribution. One would need a very strong argument to justify such a prior. I used it to demonstrate the concept.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 25 replies
  • 5906 views
  • 4 likes
  • 8 in conversation