wutao9999,
you are in very dangerous territory here, and most of your ideas on parameter estimation cannot be supported. It would never get past peer review. There are some formal things you can do, but the specifics will depend on your system. Your surprising parameter estimate (slope) could be do to one or more highly influential observations (outliers or high leverage values), or to the presence of other unmeasured variables that are (highly) correlated with x or the error (residual) of the model. For the latter, there are various things you can do, if you have the other measured variables. Use of instrument variables is one possibility. I don't think you have other variables.
For the former, there are things you can do to formally look at influential variables. I told you about this in my previous post. You look at, for instance, DFBETAS for each observation to see the influence of the observation on the parameter estimate. I like to look at Cook's distance also. You might be able to justify dropping an excessively influential observation if you can identify something wrong with it. Dropping it because you don't like the result is not a valid reason. You can also switch to robust regression methods to discount the influence of extreme observations (extreme in terms of x and/or y). This is often most beneficial with small data sets, but can be used for any. I demonstrate this with an example below (just run the sas code). I made up a very small data set that would have a negative slope except for one (really extreme) observation. The REG output clearly shows the influence of this value on the slope and intercept estimate (you don't need fancy methods for this extreme example, but it is a demonstration of DFBETAS, Cook's distance, and so on). The slope estimate is positive because of this observation. If you use ROBUSTREG (even with default estimation), that extreme observation (although still in the dataset) does not have much of an influence on the results; the estimated slope is now negative.
As an alternative to the above approaches, you can take a Bayesian approach. If you are certain that the slope must be negative based on past work, then you could place a highly informative prior on the slope. In my example below, I used GENMOD and chose a normal prior for the slope and intercept. For the informative normal prior on the slope, 95% of the distribution mass is (roughly) between -1.5 and -0.5 (in my arbitrary example). As you can see if you run my code, the mean of the posterior for the slope is about -1 with this informative prior, but the posterior mean slope is a positive 0.3 with a noninformative prior (also in my example below). But you should realize, the Bayesian estimate of the slope (with informative prior) is not the slope that best fits the observations. Rather, the posterior distribution combines prior knowledge and the new information from the current data set. With a highly informative prior, the mean of the posterior may not give a good fit to the current data. Interpretation is different.
I only touched on Bayesian approaches, just to give you an idea. I come back to my original point: you ideas for this problem cannot be supported.
data a;
input x y;
datalines;
0 50
1 45
2 47
3 45
4 44
5 40
6 38
7 41
8 80
9 35
;
run;
*see the big influence of the one observation on the coefficients (dfbetas);
*this gives positive (but not significant here) slope estimate;
proc reg data=a plots=(CooksD RStudentByLeverage DFFITS DFBETAS);
model y = x / influence;
run;
*robust regression automatically diminishes the influence of an outlier;
*now the estimated slope is negative;
proc robustreg data=a;
model y = x;
run;
*do a MLE for the same linear model;
proc genmod data=a;
model y = x;
run;
*now do a Bayesian analysis (with noninformative normal priors);
*note that mean of posterior distribution for x (the slope) is positive;
*(but with credible interval for slope covering 0);
proc genmod data=a;
model y = x;
bayes coefprior=normal; *noninformative normal priors (i.e., very large variances);
run;
*Informative prior (posterior mean for slope is mostly from -1.5 and -0.5, centered on -1);
*Non-informative prior for intercept (large variance);
data normprior;
input _type_ $ x intercept; *<--use x for the slope and intercept for intercept;
datalines;
Var 0.0625 100000
Mean -1.0 0
;
run;
*with highly informative prior for slope, one gets negative mean for the poserior for slope;
proc genmod data=a;
model y = x;
bayes coefprior=normal(input=normprior);
run;
Very helpful. Appreciate your comments...
If not concerned about the interpretation, just multiply X by -1 and run regression. This will give you negative A.
stat@sas wrote:
If not concerned about the interpretation, just multiply X by -1 and run regression. This will give you negative A.
well, um ...
How would you ever justify this method to a reviewer or colleague?
- As, I said "if not concerned about the interpretation" then we can do in this way. In modeling, sometimes we have to make some adjustments in customer scoring to present results to the client in a more attractive way. So we may to introduce some lift in parameter estimates, which are important to the business.
stat@sas wrote:
Paige Miller - As, I said "if not concerned about the interpretation" then we can do in this way. In modeling, sometimes we have to make some adjustments in customer scoring to present results to the client in a more attractive way. So we may to introduce some lift in parameter estimates, which are important to the business.
I can't see a possible justification for multiplying by -1, even in the situation where you are not concerned about the interpretation. In fact, I think its a bad idea.
I think the comment was meant to be a joke.
wutao9999 wrote:
One way I am thinking is that: If the results of A is positive, I remove one point that is most influential to cause A being positive. After removing the point, then do regression again.
I have been thinking about this. Once you fit the regression and get a positive slope, the regression diagnostics that SAS produces might show that the most influential point could be pulling the regression slope to be negative, and without that data point, the slope would be more positive (i.e. deleting this point would move the slope in the wrong direction). I think you'd have to specify that you would delete only the most influential point that is making the slope more positive and without this point the slope would be negative, or closer to negative (i.e. moving in the desired direction).
It will eventually produce a result that has a negative slope, but I don't see how this method can be justified.
I think it is time to show a PLOT of the data.
Yes please, let's see a bivariate plot.
I'd also like to ask whether this is really a simple 2 variable regression model. I know that the OP stated it that way, but what if that done to simplify the presentation of the problem to the community. Specifically, I'm wondering whether we've got a multivariate case with severe multicollinearity.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.