BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
Dennisky
Quartz | Level 8

Dear all,

We conducted a multivariate logistic regression analysis on the same dataset using both SAS and SPSS software. We found that the odds ratio of all variables was consistent between the two software.

However, when the variable is a multiple categorical variable, there are differences in the P values of SAS and SPSS analyses. Notably, the P values of continuous variables and binary variables are consistent between SAS and SPSS software.

 

We are very confused about this situation. Your help would be greatly appreciated.

Thanks

1 ACCEPTED SOLUTION

Accepted Solutions
Rick_SAS
SAS Super FREQ

The p-values are for different null hypotheses.

 

In the Parameter Estimates table, the p-value is for the null hypothesis beta=0. For example, in the image you posted, the estimate for the coefficient of (xgra=2) in the model is -0.2, which is not significantly different from 0 because the standard error is approximately 0.2.

 

In the Odds Ratio table, the null hypothesis is ratio=1. For example, in the image you posted, the estimate for the ratio of (xgra 2 vs 1) is 2.7 and a 95% CI is [1.8, 3.9]. Because this interval does not include 1, we infer that the related ratio parameter is significantly different from 1.

View solution in original post

11 REPLIES 11
Rick_SAS
SAS Super FREQ

Please show us your SAS code and explain which p-value you are looking at.

 

In general, a p-value is dependent on the distribution of a statistic under a null hypothesis. Sometimes p-values are approximate because the true sampling distribution of a statistic is unknown or is known only asymptotically for large samples.

 

Different software will obtain the same value only if the null hypothesis and the distributional assumptions are the same for both software. Clearly, that is not the case for this problem. But if you show the code, we can explain what H0, statistic, and distributional assumptions SAS is using.

Dennisky
Quartz | Level 8

Thank you so much!   

Thank you very much for the attention and guidance of every expert.

For example, xgra is a three-category variable, divided into categories 1, 2, and 3. And xindl is a binary variable, divided into categories 0 and 1.

They are the two variables in our logistic analysis.

Y is a dependent variable divided into two categories, 0 and 1.

This our SAS code for the example data :

* proc logistic descending data=s1;

class xgra(ref="1") xindl(ref="0");

model y=xgra xindl;

run; *

 

And we upload figure1-2 for the result of SAS (figure1) and SPSS(figure2), respectively.

The content in the red box represents the p-value results of the multi-class variable xgra .

Although the results of the odds ratio are the same for both software, we can observe that the p-value results from the two software are not the same, even opposite (one is significant while the other is not significant).

 

 

figure1.pngfigure2.jpeg

Rick_SAS
SAS Super FREQ

Thank you for uploading the images and posting the PROC LOGISTIC statements. Your output shows that the parameter estimates from SAS and SPSS are different. Therefore, you should not be wondering why the p-values are different but why the coefficient estimates are different.

 

As StatDave says, the most likely difference is the coding (parameterization) for the categorical independent variables. PROC LOGISTIC uses effect coding by default. From the SPSS output, it looks like they are using "dummy encoding," which SAS calls GLM encoding.

Try modifying your CLASS statement to be

class xgra(ref="1") xindl(ref="0") / param=GLM;

and let us know whether that provides parameter estimates that match your SPSS output.

Dennisky
Quartz | Level 8

Thanks!

We attempted to modify our CLASS statement as you and StatDave had mentioned.

We have successfully resolved the issue. It was done perfectly.

 

Notably, we still have a question.

Why does the SAS output show that the OR value of variable xgra  (xgra 2 vs1 OR:2.684(1.845-3.904) is significant, but the p-value(p=0.2675) is not?  The SAS result shows the opposite conclusion for it.(see figure1)

Rick_SAS
SAS Super FREQ

The p-values are for different null hypotheses.

 

In the Parameter Estimates table, the p-value is for the null hypothesis beta=0. For example, in the image you posted, the estimate for the coefficient of (xgra=2) in the model is -0.2, which is not significantly different from 0 because the standard error is approximately 0.2.

 

In the Odds Ratio table, the null hypothesis is ratio=1. For example, in the image you posted, the estimate for the ratio of (xgra 2 vs 1) is 2.7 and a 95% CI is [1.8, 3.9]. Because this interval does not include 1, we infer that the related ratio parameter is significantly different from 1.

StatDave
SAS Super FREQ

The hypotheses differ, as Rick says, but specifically because of the coding difference that I mentioned before. Without the PARAM=GLM option, the effects coding that is used causes the parameter to estimate the difference between the XGRA=2 level and the average of all the XGRA levels. It does NOT compare XGRA=2 to the XGRA=1, which is what the odds ratio estimate does. Hence the hypothesis difference and also the fact that the parameter estimates differ. If you use the PARAM=GLM option, the parameter estimates the 2 vs 1 difference and should then agree with the conclusion based on the odds ratio. All this is described in this note and you should probably also read this related note that explains how the coding difference also causes the odds ratio to not equal the exponentiated parameter estimate.

Dennisky
Quartz | Level 8
Thank you for your suggestion, which enables us to have a deeper understanding of this principle.
StatDave
SAS Super FREQ

This is likely due to a difference in the parameterization (coding) of the design ("dummy") variables that represent a categorical variable when fitting the model. If you are using PROC LOGISTIC, and the variables that show the difference are specified in the CLASS statement, and if you did not specify the PARAM= option in the CLASS statement, then PROC LOGISTIC uses effects coding to create the design variables. If the other software uses the typical 0,1 coding for the design variables that it creates, then you should specify the PARAM=GLM option in the CLASS statement in order to have PROC LOGISTIC use the same coding. For example:  class x1 x2 / param=glm;

Dennisky
Quartz | Level 8

Thank you for your suggestions, which perfectly solved my problem. Each of your answers sparkles with the light of wisdom and brings me a lot of inspiration.

ballardw
Super User

The question I would ask is how often an observed difference has a practical impact on analysis.

Even running the same version of the same release of software on different computers can result in different results due to differences in versions of math co-processors (when they were different chips) or just the main processor.

 

Between different software the algorithms chosen to implement a specific calculation can result in different results because of limits of precision in internal storage. Plus you have the whole "decimal values often cannot be stored exactly with binary" issue. So you get some amount of rounding differences that can accumulate.

 

I had a reason to compare SUDAAN, another software used for statistics in complex weighting of data, with SAS. I could detect differences in the confidence limits between SUDAAN and SAS output but the differences usually were detectable at the 0.001 position in percentages. In the data that I was using that meant at most the limits when projected onto the population of interest might vary by almost 0.3 persons (yes threee-tenths of a person) which we deemed as not a practical impact on the decisions that would be made using the results.

 

Or consider something like house pricing that typically runs with values recently well over $100,000. Would a difference in analysis of pricing that varied by $0.57 (57 cents) make much difference in a practical sense on the analysis?

pink_poodle
Barite | Level 11
@Dennisky, how different are your p-values for SAS and SPSS? Could you please give an example?

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 11 replies
  • 1255 views
  • 17 likes
  • 5 in conversation