Solved: Re: Comparing two independent samples for count data

Recep · Posted 09-14-2021 07:57 PM

Hello,

I'm trying to figure out what test would be appropriate to compare two independent samples for count data:

data test;
input year city1 city2;
datalines;
2016 220 130
2017 140 180
2018 120 202
2019 140 134
2020 135 166
;

I don't have any denominators for the cities by year (for instance the frequencies in the city1 and city2 represent number of traffic accidents for each year). If I want to test if the number of traffic accidents differ by the cities what test I can use?

Thanks a lot in advance!

StatDave · Posted 09-15-2021 09:38 AM

Count data are typically modeled using the Poisson or negative binomial distribution. Such models are easily fit in procedures like GENMOD, GLIMMIX, and HPGENSELECT. For example, the following fits a model using the negative binomial distribution which accommodates overdispersion in the data.

data test;
input year city1 city2;
y=city1; city=1; output;
y=city2; city=2; output;
datalines;
2016 220 130
2017 140 180
2018 120 202
2019 140 134
2020 135 166
;
proc genmod;
class city;
model y=city / dist=negbin;
run;

View solution in original post

Reeza · Posted 09-14-2021 08:12 PM

1. First standardize for the population or # of drivers or # of cars in each city

2. Then look at PROC FREQ with either a chi-square test or a cochran-armitage test.

If you don't want to account for year, sum them up and use ChiSquare.

If you want to account for year, use Cochran-Armitage

https://documentation.sas.com/doc/en/statug/15.2/statug_freq_details76.htm

Standardization is important here. If I compare a city of 1 million to a city of 5 million the accident counts should not be expected to be the same.

@Recep wrote:

Hello,

I'm trying to figure out what test would be appropriate to compare two independent samples for count data:

data test;
input year city1 city2;
datalines;
2016 220 130
2017 140 180
2018 120 202
2019 140 134
2020 135 166
;

I don't have any denominators for the cities by year (for instance the frequencies in the city1 and city2 represent number of traffic accidents for each year). If I want to test if the number of traffic accidents differ by the cities what test I can use?

Thanks a lot in advance!

Recep · Posted 09-14-2021 09:35 PM

Hi Reeza,
Thanks a lot for your response but as I mentioned in my question I do not have any sort of denominator information. The example I provided was fictitious. You can assume instead of number of accidents those are the number of meteorites that fell into each city from the sky and I want to know if one city has more meteorites fallen than the other one.
Cheers....

Reeza · Posted 09-14-2021 10:51 PM

Then go find the spatial area of your city. That’s likely constant over time at least so just two values to look up. Otherwise, you’re comparing apples and oranges.

@Recep wrote:
Hi Reeza,
Thanks a lot for your response but as I mentioned in my question I do not have any sort of denominator information. The example I provided was fictitious. You can assume instead of number of accidents those are the number of meteorites that fell into each city from the sky and I want to know if one city has more meteorites fallen than the other one.
Cheers....

Ksharp · Posted 09-15-2021 08:58 AM

You could try K-S test.

data test;
input year city1 city2;
datalines;
2016 220 130
2017 140 180
2018 120 202
2019 140 134
2020 135 166
;
data have;
 set test;
 city='city1';count=city1;output;
 city='city2';count=city2;output;
 keep city count;
run;

proc npar1way data=have plots=edfplot edf ;
class city;
var count;
run;

But your case is special due to have YEAR variable.

Maybe @Rick_SAS @StatDave have some good idea .

Rick_SAS · Posted 09-15-2021 09:17 AM

You could try a paired t test. The procedure includes graphical output to help you assess whether the data might satisfy the assumptions of the test:

ods graphics on;
proc ttest data=test;
   paired city1*city2;
run;

Ksharp · Posted 09-15-2021 10:07 AM

Rick,
I like your idea. But ttest is parameter method ,NOT non-parameter method like K-S test.
proc ttest is usually suited for NORMAL data ,not count data I think !

Rick_SAS · Posted 09-15-2021 10:35 AM

@Ksharp : Thanks for your criticism. I am aware of the assumptions of the three procedures that were suggested. In many cases, count data are well-approximated by a normal distribution, but you are certainly entitled to your opinion. If there were more data, we could debate the issue, but a debate seems pointless when the OP's data contains 5 observations. For the posted data, I doubt it matters which method is used.

StatDave · Posted 09-15-2021 09:38 AM

Count data are typically modeled using the Poisson or negative binomial distribution. Such models are easily fit in procedures like GENMOD, GLIMMIX, and HPGENSELECT. For example, the following fits a model using the negative binomial distribution which accommodates overdispersion in the data.

data test;
input year city1 city2;
y=city1; city=1; output;
y=city2; city=2; output;
datalines;
2016 220 130
2017 140 180
2018 120 202
2019 140 134
2020 135 166
;
proc genmod;
class city;
model y=city / dist=negbin;
run;

Recep · Posted 09-15-2021 01:10 PM

Thanks a lot Dave! Then I'm assuming that the p-value (0.5548 in this example) will tell if the two cities are statistically significantly different from each other (or more technically, in this example, we have no reason to reject the null hypothesis which assumes there is no difference between two cities).

Reeza · Posted 09-15-2021 01:43 PM

If this is for homework go with that. If this is for decision making, then what I said earlier still applies and you cannot compare the raw numbers.

StatDave · Posted 09-15-2021 02:04 PM

Yes, that's correct.