BookmarkSubscribeRSS Feed
lvm
Rhodochrosite | Level 12 lvm
Rhodochrosite | Level 12

wutao9999,

you are in very dangerous territory here, and most of your ideas on parameter estimation cannot be supported. It would never get past peer review. There are some formal things you can do, but the specifics will depend on your system. Your surprising parameter estimate (slope) could be do to one or more highly influential observations (outliers or high leverage values), or to the presence of other unmeasured variables that are (highly) correlated with x or the error (residual) of the model. For the latter, there are various things you can do, if you have the other measured variables. Use of instrument variables is one possibility. I don't think you have other variables.

For the former, there are things you can do to formally look at influential variables. I told you about this in my previous post. You look at, for instance, DFBETAS for each observation to see the influence of the observation on the parameter estimate. I like to look at Cook's distance also. You might be able to justify dropping an excessively influential observation if you can identify something wrong with it. Dropping it because you don't like the result is not a valid reason. You can also switch to robust regression methods to discount the influence of extreme observations (extreme in terms of x and/or y). This is often most beneficial with small data sets, but can be used for any. I demonstrate this with an example below (just run the sas code). I made up a very small data set that would have a negative slope except for one (really extreme) observation. The REG output clearly shows the influence of this value on the slope and intercept estimate (you don't need fancy methods for this extreme example, but it is a demonstration of DFBETAS, Cook's distance, and so on). The slope estimate is positive because of this observation.  If you use ROBUSTREG (even with default estimation), that extreme observation (although still in the dataset) does not have much of an influence on the results; the estimated slope is now negative.

As an alternative to the above approaches, you can take a Bayesian approach. If you are certain that the slope must be negative based on past work, then you could place a highly informative prior on the slope. In my example below, I used GENMOD and chose a normal prior for the slope and intercept. For the informative normal prior on the slope, 95% of the distribution mass is (roughly) between -1.5 and -0.5 (in my arbitrary example). As you can see if you run my code, the mean of the posterior for the slope is about -1 with this informative prior, but the posterior mean slope is a positive 0.3 with a noninformative prior (also in my example below). But you should realize, the Bayesian estimate of the slope (with informative prior) is not the slope that best fits the observations. Rather, the posterior distribution combines prior knowledge and the new information from the current data set. With a highly informative prior, the mean of the posterior may not give a good fit to the current data. Interpretation is different.

I only touched on Bayesian approaches, just to give you an idea. I come back to my original point: you ideas for this problem cannot be supported.

data a;

input x y;

datalines;

0    50

1    45

2    47

3    45

4    44

5    40

6    38

7    41

8    80

9    35

;

run;

*see the big influence of the one observation on the coefficients (dfbetas);

*this gives positive (but not significant here) slope estimate;

proc reg data=a plots=(CooksD RStudentByLeverage DFFITS DFBETAS);

model y = x / influence;

run;

*robust regression automatically diminishes the influence of an outlier;

*now the estimated slope is negative;

proc robustreg data=a;

model y = x;

run;

*do a MLE for the same linear model;

proc genmod data=a;

model y = x;

run;

*now do a Bayesian analysis (with noninformative normal priors);

*note that mean of posterior distribution for x (the slope) is positive;

*(but with credible interval for slope covering 0);

proc genmod data=a;

model y = x;

bayes coefprior=normal;    *noninformative normal priors (i.e., very large variances);

run;

*Informative prior (posterior mean for slope is mostly from -1.5 and -0.5, centered on -1);

*Non-informative prior for intercept (large variance);

data normprior;       

input _type_ $ x intercept;    *<--use x for the slope and intercept for intercept;

datalines;

Var  0.0625    100000

Mean -1.0    0

;

run;

*with highly informative prior for slope, one gets negative mean for the poserior for slope;

proc genmod data=a;

model y = x;

bayes coefprior=normal(input=normprior);

run;

wutao9999
Obsidian | Level 7

Very helpful.  Appreciate your comments...

stat_sas
Ammonite | Level 13

If not concerned about the interpretation, just multiply X by -1 and run regression. This will give you negative A.

PaigeMiller
Diamond | Level 26

stat@sas wrote:

If not concerned about the interpretation, just multiply X by -1 and run regression. This will give you negative A.

well, um ...

How would you ever justify this method to a reviewer or colleague?

--
Paige Miller
stat_sas
Ammonite | Level 13

- As, I said "if not concerned about the interpretation" then we can do in this way. In modeling, sometimes we have to make some adjustments in customer scoring to present results to the client in a more attractive way. So we may to introduce some lift in parameter estimates, which are important to the business.

PaigeMiller
Diamond | Level 26

stat@sas wrote:

Paige Miller - As, I said "if not concerned about the interpretation" then we can do in this way. In modeling, sometimes we have to make some adjustments in customer scoring to present results to the client in a more attractive way. So we may to introduce some lift in parameter estimates, which are important to the business.

I can't see a possible justification for multiplying by -1, even in the situation where you are not concerned about the interpretation. In fact, I think its a bad idea.

--
Paige Miller
lvm
Rhodochrosite | Level 12 lvm
Rhodochrosite | Level 12

I think the comment was meant to be a joke.

PaigeMiller
Diamond | Level 26

wutao9999 wrote:

One way I am thinking is that: If the results of A is positive, I remove one point that is most influential to cause A being positive.  After removing the point, then do regression again.

I have been thinking about this. Once you fit the regression and get a positive slope, the regression diagnostics that SAS produces might show that the most influential point could be pulling the regression slope to be negative, and without that data point, the slope would be more positive (i.e. deleting this point would move the slope in the wrong direction). I think you'd have to specify that you would delete only the most influential point that is making the slope more positive and without this point the slope would be negative, or closer to negative (i.e. moving in the desired direction).

It will eventually produce a result that has a negative slope, but I don't see how this method can be justified.

--
Paige Miller
ballardw
Super User

I think it is time to show a PLOT of the data.

SteveDenham
Jade | Level 19

-- A thousand huzzahs to your comment!  Any time someone starts doing regression without looking plots of the data they are almost assuredly going to end up in a really, really bad place.

Now if I could just figure out how to visualize that thirteen dimensional plot...

Steve Denham

djmangen
Obsidian | Level 7

Yes please, let's see a bivariate plot.

I'd also like to ask whether this is really a simple 2 variable regression model.  I know that the OP stated it that way, but what if that done to simplify the presentation of the problem to the community.  Specifically, I'm wondering whether we've got a multivariate case with severe multicollinearity.

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 25 replies
  • 6932 views
  • 4 likes
  • 8 in conversation