Hello,
I'm trying to figure out what test would be appropriate to compare two independent samples for count data:
data test;
input year city1 city2;
datalines;
2016 220 130
2017 140 180
2018 120 202
2019 140 134
2020 135 166
;
I don't have any denominators for the cities by year (for instance the frequencies in the city1 and city2 represent number of traffic accidents for each year). If I want to test if the number of traffic accidents differ by the cities what test I can use?
Thanks a lot in advance!
Count data are typically modeled using the Poisson or negative binomial distribution. Such models are easily fit in procedures like GENMOD, GLIMMIX, and HPGENSELECT. For example, the following fits a model using the negative binomial distribution which accommodates overdispersion in the data.
data test;
input year city1 city2;
y=city1; city=1; output;
y=city2; city=2; output;
datalines;
2016 220 130
2017 140 180
2018 120 202
2019 140 134
2020 135 166
;
proc genmod;
class city;
model y=city / dist=negbin;
run;
1. First standardize for the population or # of drivers or # of cars in each city
2. Then look at PROC FREQ with either a chi-square test or a cochran-armitage test.
If you don't want to account for year, sum them up and use ChiSquare.
If you want to account for year, use Cochran-Armitage
https://documentation.sas.com/doc/en/statug/15.2/statug_freq_details76.htm
Standardization is important here. If I compare a city of 1 million to a city of 5 million the accident counts should not be expected to be the same.
@Recep wrote:
Hello,
I'm trying to figure out what test would be appropriate to compare two independent samples for count data:
data test;
input year city1 city2;
datalines;
2016 220 130
2017 140 180
2018 120 202
2019 140 134
2020 135 166
;
I don't have any denominators for the cities by year (for instance the frequencies in the city1 and city2 represent number of traffic accidents for each year). If I want to test if the number of traffic accidents differ by the cities what test I can use?
Thanks a lot in advance!
Then go find the spatial area of your city. That’s likely constant over time at least so just two values to look up. Otherwise, you’re comparing apples and oranges.
@Recep wrote:
Hi Reeza,
Thanks a lot for your response but as I mentioned in my question I do not have any sort of denominator information. The example I provided was fictitious. You can assume instead of number of accidents those are the number of meteorites that fell into each city from the sky and I want to know if one city has more meteorites fallen than the other one.
Cheers....
You could try K-S test.
data test; input year city1 city2; datalines; 2016 220 130 2017 140 180 2018 120 202 2019 140 134 2020 135 166 ; data have; set test; city='city1';count=city1;output; city='city2';count=city2;output; keep city count; run; proc npar1way data=have plots=edfplot edf ; class city; var count; run;
But your case is special due to have YEAR variable.
You could try a paired t test. The procedure includes graphical output to help you assess whether the data might satisfy the assumptions of the test:
ods graphics on;
proc ttest data=test;
paired city1*city2;
run;
@Ksharp : Thanks for your criticism. I am aware of the assumptions of the three procedures that were suggested. In many cases, count data are well-approximated by a normal distribution, but you are certainly entitled to your opinion. If there were more data, we could debate the issue, but a debate seems pointless when the OP's data contains 5 observations. For the posted data, I doubt it matters which method is used.
Count data are typically modeled using the Poisson or negative binomial distribution. Such models are easily fit in procedures like GENMOD, GLIMMIX, and HPGENSELECT. For example, the following fits a model using the negative binomial distribution which accommodates overdispersion in the data.
data test;
input year city1 city2;
y=city1; city=1; output;
y=city2; city=2; output;
datalines;
2016 220 130
2017 140 180
2018 120 202
2019 140 134
2020 135 166
;
proc genmod;
class city;
model y=city / dist=negbin;
run;
Thanks a lot Dave! Then I'm assuming that the p-value (0.5548 in this example) will tell if the two cities are statistically significantly different from each other (or more technically, in this example, we have no reason to reject the null hypothesis which assumes there is no difference between two cities).
Yes, that's correct.
Available on demand!
Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.