topic Re: Comparison of two empirical distributions via KS in Statistical Procedures

Comparison of two empirical distributions via KS

alexgonzalez — Mon, 05 Oct 2020 18:22:38 GMT

Hello,

I’m testing whether two empirical distributions are identical or not. I have a group of people with two observations for the variable ‘EKC’, ‘before’ and ‘after’ some intervention. I’m using K-S for the comparison of both distributions. Additionally, observations come from a national survey, so each individual contains a survey weight (i.e. weight) to produce national estimates. See code below:

ods graphics on;

proc npar1way data = dat edf;

freq weight;

class time;

var ekc;

ods output KS2Stats=ks;

run;

Notice in the table below, that both distribution are very similar (almost identical).

Percentiles	Before	After	Differences	% change
100% Max	3289.1	3279.5	-9.6	0.3
99%	2443.7	2436.6	-7.1	0.3
95%	2180.5	2173.5	-6.9	0.3
90%	2047.3	2040.8	-6.4	0.3
75% Q3	1838.8	1832.6	-6.2	0.3
50% Median	1623.3	1617.6	-5.6	0.3
25% Q1	1427.8	1422.7	-5.2	0.4
10%	1271.8	1266.9	-4.9	0.4
5%	1187.6	1182.7	-4.9	0.4
1%	1029.4	1024.9	-4.6	0.4
0% Min	642.3	638.4	-3.9	0.6

Nevertheless, the K-S for the comparison of the two samples suggest the both distributions are different ((Pr > KSa) <.0001).

I’m not sure how the fact that both empirical distributions are not independent (note they come from the same groups of individuals before and after some intervention) can affect the test. If so, can you please suggest an alternative valid test?

Thanks a lot,

A.G.

Re: Comparison of two empirical distributions via KS

Reeza — Mon, 05 Oct 2020 19:27:50 GMT

P-Values measure statistical significance not practical significance. The difference there is measured and is statistically significant but perhaps a 0.3% decrease is not what you were looking for? In this case the distribution has shifted so it is different.
And remember if you have a large N, small differences are easier to pick up and more likely to be statistically significant even if they're not practically significant.

Re: Comparison of two empirical distributions via KS

PGStats — Mon, 05 Oct 2020 21:55:05 GMT

The apparent extreme sensitivity of the KS test here is due to the use of the FREQ statement. Freq specifies a frequency, not a weight. When you say "x=10, freq=100" the procedure considers that you have 100 independent measurements at 10, not a single measurement with a sampling weight of 100. SAS does not provide a weighted KS test (if such a thing exists).

Properly weighted statistics are provided by the SURVEYxxxx procs.

Re: Comparison of two empirical distributions via KS

alexgonzalez — Tue, 06 Oct 2020 11:02:06 GMT

You're right Reeza, statistical and practical significance are not the same. Having said that, in this case I'm shocked the p-value for the KS test is <0.0001 even though both distributions almost perfectly overlap when plotted together. I would't expect to have such a small p-value. I should probably stick with my approach to comparing distributions using % changes, it's way more meaningful to me in this case.
Thank you!

Re: Comparison of two empirical distributions via KS

alexgonzalez — Tue, 06 Oct 2020 11:10:28 GMT

When certain statistical procedure is not available for weighted observations (i.e. SURVEYxxxx proc), and alternative way to deal with that is to replicate observations in the dataset based on the survey weights.
I suspect there might be two issues here. As Reeza pointed out, the big sample size might be causing picking up statisticial significance when there is no. Additionally, observations from both groups are not independent. Not sure how sensitive KS is to this.
Thank you!

Re: Comparison of two empirical distributions via KS

Reeza — Tue, 06 Oct 2020 15:45:08 GMT

You cannot visually see it well, but the curve has shifted, if you graph the densities you may see it more easily.
If it's pre-post measures though, you usually analyze the difference in the scores and see if that's centered on 0.

Re: Comparison of two empirical distributions via KS

alexgonzalez — Tue, 06 Oct 2020 17:16:53 GMT

This is clearly one of those cases whether there might be statistical significance, but not a practical one. I ran an alternative analysis to compare the two means (paired comparison), and they turned out to be 'statistically' significant.

Re: Comparison of two empirical distributions via KS

Reeza — Tue, 06 Oct 2020 18:05:16 GMT

Then I'd argue your hypothesis is not well defined. Is it diff>0 or is it diff>x% or diff>45 units.
Right now you're using the 'default' hypothesis of 0 but that doesn't have to be true...

Re: Comparison of two empirical distributions via KS

alexgonzalez — Tue, 06 Oct 2020 18:49:24 GMT

I'm not sure why you think my hypothesis is not well defined. I'm interested in the 'zero' differences. Maybe I was not clear enough in my previous message. Both, K-S and the paired test for the difference in means are both consistent and yield 'statistical significance'. Can you please clarify what you do think so?
Thanks.

Re: Comparison of two empirical distributions via KS

Reeza — Tue, 06 Oct 2020 19:16:34 GMT

Your test is for a difference of 0, but you seem to want a difference of X% or Y raw value as a minimum which is a different hypothesis. You can change your hypothesis to account for practical significance....

Re: Comparison of two empirical distributions via KS

alexgonzalez — Wed, 07 Oct 2020 11:33:10 GMT

Now I get what you mean, Reeza. That's exactly what I should do. Any advice on how to specify a difference other tha zero for the K-S test to compare two distributions in the NPAR1WAY procedure?