I have noticed that the Cochran-Armitage Test for Trend gives different results when an interval variable is grouped by proc format compared to recoding. For example:
proc format;
value EngineSizeFmt
low -< 3 = "0-3"
3 -< 5 = "3-4"
5 -< 9= "5-8"
;
run;
proc freq data=sashelp.cars;
where origin in("Europe" "USA");
tables EngineSize*origin /trend chisq nopercent norow;
format EngineSize EngineSizeFmt.;
run;
data work.cars;
set sashelp.cars;
if EngineSize lt 3 then esize=1;
if EngineSize ge 3 and EngineSize lt 5 then esize=2;
if EngineSize ge 5 then esize=3;
run;
proc freq data=work.cars;
where origin in("Europe" "USA");
tables ESize*origin /trend chisq nopercent norow;
run;
The frequencies and chi-square test are the same, the trend test is different. Any thoughts? Which should be used?
Building off the comments of @ballardw
Here is the output after the table that I get from your two PROC FREQ calls:
You'll notice that the output for the Mantel-Haenszel Chi-Square is different too. Both the MH and CA test for trend use scores in their calculation.
According to the documentation for the SCORES= option on the TABLE statement, you can specify the SCOROUT option on the TABLES statement to display the scores used:
SAS Help Center: TABLES Statement
I used the SCOROUT option and get these scores for your two PROC FREQ calls. That the scores are different is why the MH and CA tests are providing different output across the two calls:
That is because the order of "Europe" "USA" is reverser.
For the frist PROC FREQ:
For the second PROC FREQ:
Therefore , you could get reverse result!!!!
Thanx Ksharp for answering. But, both frequency tables look the same the first column is Europe and the second USA. I am probably missing something.
I suspect it may have something to do with this detail from the Cochran-Armitage details
For character variables, the table scores for the row variable are the row numbers
(for example, 1 for the first row, 2for the second row, and so on). For numeric variables,
the table score for each row is the numeric value of the row level.
The "formatted" value is character.
Since some of the cells are a bit small, the largest size has 29 of the 270 observations, perhaps using the RANKS instead of the default table Score is appropriate and does yield the same score for both. Which is why I think there may be some oddity in the score calculation for the formatted values.
proc freq data=work.cars; where origin in("Europe" "USA"); tables ESize*origin enginesize*origin/ trend score=rank ; format EngineSize EngineSizeFmt.; run;
which without the frequency tables generates:
Statistics for Table of esize by Origin |
Cochran-Armitage Trend Test (Rank Scores) |
|
---|---|
Statistic (Z) | -2.0200 |
One-sided Pr < Z | 0.0217 |
Two-sided Pr > |Z| | 0.0434 |
and
Statistics for Table of EngineSize by Origin |
Cochran-Armitage Trend Test (Rank Scores) |
|
---|---|
Statistic (Z) | -2.0200 |
One-sided Pr < Z | 0.0217 |
Two-sided Pr > |Z| | 0.0434 |
Why group a continuous variable at all? Especially if you are concerned about trend ...
Building off the comments of @ballardw
Here is the output after the table that I get from your two PROC FREQ calls:
You'll notice that the output for the Mantel-Haenszel Chi-Square is different too. Both the MH and CA test for trend use scores in their calculation.
According to the documentation for the SCORES= option on the TABLE statement, you can specify the SCOROUT option on the TABLES statement to display the scores used:
SAS Help Center: TABLES Statement
I used the SCOROUT option and get these scores for your two PROC FREQ calls. That the scores are different is why the MH and CA tests are providing different output across the two calls:
Thanx to @JackieJ_SAS and @ballardw explaining the difference. Good to know that there might be some differences in the CA and MH tests depending on how the grouping of a continuous variable is done.
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.