If I have a dataset like below
School Group activity_2_rate activity_2_rate
1 1 50% 60%
2 1 67% 23%
3 1 64% 60%
4 2 50% 30%
5 2 50% 60%
6 2 60% 60%
And I want to compare the mean difference of two groups on activity_1_rate and activity_2_rate, separately, what kind of test should I use? I thought might be I can use the T-test because I don't think the chi-square test would work in this case. But I am not sure about it, I have some doubts about using the T-test. First, those percentages might not be normally distributed. Second, the percentage might be different in denominators, for example, 50% might result from 5/10, but 40% might result from 40/100. In this case, can I still use the T-test? Or are there any other tests that I can use instead to compare the two groups?
Could anyone help me with it? Thank you so much!
Well, if you know both the numerator and denominator for each estimate, and since the numerators are all > 40, then you could use:
proc genmod data=yourdata;
class group;
model numerator1/denominator1 = group /dist=binomial type3;
lsmeans group/diff ilink;
run;
Where numerator1 and denominator1 are the values for activity_1, and would be replaced by numerator2 and denominator2 for activity_2. Since it seems from your communications that the denominator would be the same for both activities because it is the enrollment at the school, you could simplify a little bit.
SteveDenham
Do you know the number of data points in each group and each school?
There are two groups in total, and each group has 10 schools, so 20 schools in total.
And, I think that's how they compute the activity_1_rate and activity_2_rate:
activity_1_rate=(ac_1_a_rate+ac_1_b_rate+ac_1_c_rate+ac_1_d_rate)/4, and I think they have sample size for each ac_1_a_rate, ac_1_b_rate, ac_1_c_rate, ac_1_d_rate. (Some sample size might missed).
And it's the same way to compute the activity_2_rate.
If you mean the school, each group has 10 schools, so 20 schools in total. I understand the power might be a problem, but it would be nice if you could tell me a method to compare the mean difference and take the percentage and sample size (that used to compute different activity rates) into consideration. Do you have any ideas?
The number of schools is irrelevant, you need to know the numerator/denominator of those rates.
If you have an N of 10 versus and N of 6000 the answers differ. In general, if your sample sizes are similar and large you're fine with the t-test.
I think I know the denominator of those rates, even though only one denominator is missing. But the observations for each school are different, varying from 80 to 400 I think. If you said I can use the T-test, how could I reflect the denominators of these rates?
OK, if I use the raw data, do you mean I should only use the N instead of the percentage and compare the N for two groups using the T-test?
I think you'd need to explain your experimental design and hypothesis for us to recommend a methodology.
OK, so I have two groups of schools, and each school hosts two activities. And the whole activity has been hosted for a year. The activity_1_rate is the number of students who chose to participate in the activity one divides the total number of students. Now, I want to compare if the activity_1_rate has a mean difference between two groups of schools. That's pretty much of the study, I am not sure if I describe it clearly? So, pretty much it's either chi-square test or T-test, since I only want to compare the mean difference of the two groups. But the question is, each school's students are different, ranging from 80 to 400. So if I only compare the percentage, the result might not be valid. If there is a way, to take both the percentage and the number of students into consideration?
@SAS-questioner wrote:
OK, so I have two groups of schools, and each school hosts two activities. And the whole activity has been hosted for a year. The activity_1_rate is the number of students who chose to participate in the activity one divides the total number of students. Now, I want to compare if the activity_1_rate has a mean difference between two groups of schools. That's pretty much of the study, I am not sure if I describe it clearly? So, pretty much it's either chi-square test or T-test, since I only want to compare the mean difference of the two groups. But the question is, each school's students are different, ranging from 80 to 400. So if I only compare the percentage, the result might not be valid. If there is a way, to take both the percentage and the number of students into consideration?
Given your experimental design, I don't think that a t-test or Chi-square is appropriate here. You may want to look into @SteveDenham suggestion. If you were comparing Activity 1 to Activity 2 per school (adjusting for multiple testing), then a t-test would be appropriate but that isn't what you have here.
@SAS-questioner wrote:
I think I know the denominator of those rates, even though only one denominator is missing. But the observations for each school are different, varying from 80 to 400 I think. If you said I can use the T-test, how could I reflect the denominators of these rates?
We need the number of observations! (Probably that is the number of students)
Not the number of schools. Although perhaps a superior analysis would be to take into account the effect of each school ...
OK, if I use the raw data, do you mean I should only use the N instead of the percentage and compare the N for two groups using the T-test?
You need both N and the percent.
If the raw data is binary for each student, you can use that as well.
If the interest here is to compare the two groups, AND you have reason to believe that the percentages are valid estimates for some population (like repeatedly measuring the given schools), you might want to consider using PROC GENMOD to analyze your data. After converting percentages to proportions by dividing by 100, this could give you a first shot at an analysis:
proc genmod data=yourdata;
class group;
model activity_1_rate = group /dist=binomial type3;
lsmeans group/diff ilink;
run;
A separate analysis would be done for activity_2_rate.
@StatDave would likely also recommend following this up with the %NLmeans macro to correctly compare the means on the original scale.
Pay attention to @Reeza 's comment about power - an N of 3 schools per group is only going to detect large differences in the dependent variable.
SteveDenham
Thank you for your suggestion, but will the method takes the sample size into consideration? Like I mentioned above, the activity_1_rate is computed by using the number of students who participated in activity 1 divides the total number of students. And the number of students is different for each school, ranging from 80 to 500. If I only compare the percentage without taking the sample size into consideration, will the result be not valid? If I use this method, should I put the numerator (number of students who participate in activity 1) as weight? Or I should only replace the percentage with numerator, and use a T-test?
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.