topic Re: NPAR1WAY, SURVEYSELECT and Kolmogorov-Smirnov two-sample tests on weighted data in Statistical Procedures

NPAR1WAY, SURVEYSELECT and Kolmogorov-Smirnov two-sample tests on weighted data

ewolin — Sat, 23 Aug 2014 20:06:21 GMT

In need to compare two distributions using NPAR1WAY and two-sample K-S tests,but one of them is weighted. If I set FREQ to the weight I get the correct cumulative distribution for the weighted data, but NPAR1WAY calculates the p-value incorrectly. It thinks the number of entries in the cumulative distribution is the sum of the weights, whereas it is much lower (thus the p-values are too low). Given the D-statistic, which I think SAS calculates correctly from the two cumulative distributions, I believe I can recalculate the p-value from the correct numbers of entries in the two distributions.

Is there a way to get NPAR1WAY to correctly calculate the p-value? Problem is I have to do this for 400 different pairs of distributions!

Can I somehow use SURVEYSELECT to resample the weighted distribution to get an unweighted distribution having the original number of observations? E.g. if the unweighted data set has 10K entries, and the sum of the weights is 200M, can SURVEYSELECT produce a data set with 10K entries that reproduces the weighted sample cumulative distribution?

Re: NPAR1WAY, SURVEYSELECT and Kolmogorov-Smirnov two-sample tests on weighted data

PGStats — Sun, 24 Aug 2014 18:00:52 GMT

Not sure a weighted K-S test exists. Is there a reference describing such a test? - PG

Re: NPAR1WAY, SURVEYSELECT and Kolmogorov-Smirnov two-sample tests on weighted data

ewolin — Sun, 24 Aug 2014 18:07:05 GMT

The weights are essentially predicted frequencies, and I use them as such. They are based on a full survey design and using the weights/frequencies when plotting a variable should give a distribution that is close to what one would get if one sampled the entire US civilian population where every sample had weight equal to one.

The K-S test should work fine, I just want to find a way to get SAS to calculate the p-value correctly. It gets the d-statistic correct, I believe.