- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
hi, i have spent some time searching on Google to see how i can manually estimate the KS for two samples and then i estimated them using Excel and compared the results with proc npar1way.
Below are the results for the same data :
manual KS : 0.059424551
npar1way : 0.066667 (D =0.133333)
I can't find why the outcome is not the same ?
In the attached excel is the raw data and the manual estimation (calculations can be found in cells)
In addition, in the SAS tab i have provided the data as i use them in SAS to run the npar1way
ods graphics on; proc npar1way edf plots=edfplot data=final; class source; var application_score; output out=stat edf; /*exact ks;*/ run; ods graphics off;
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi @Toni2,
@Toni2 wrote:
manual KS : 0.059424551
npar1way : 0.066667 (D =0.133333)
I can't find why the outcome is not the same ?
To get back to your original question: Your "manual KS" uses cumulative sums of scores (relative to overall totals of scores), whereas the values of the empirical distribution function (EDF) are proportions of cumulative numbers of scores, relative to the total number of scores. So, in the calculation you need to count scores (to find out how many of them are less than or equal to some value), not to add scores. The cumulative percentages that you can obtain with PROC FREQ are basically values of the EDF in percent.
Do you trust the rightmost column "Cumulative Percent" of a basic PROC FREQ output? If so, you can use PROC FREQ to compute the EDFs "manually" and then apply the formula of the KS statistic from the PROC NPAR1WAY documentation.
Here's how (using your dataset FINAL with variable application_score renamed to ascore for brevity):
/* Store number of scores from 'pop' and 'smp' in macro variables n1, n2 */
proc sql noprint;
select count(ascore) into :n1 trimmed
from final(where=(source='pop'));
select count(ascore) into :n2 trimmed
from final(where=(source='smp'));
quit;
/* Compute EDF (in %) for SOURCE='pop' */
proc freq data=final noprint;
where source='pop';
tables ascore / outcum out=edf1(keep=ascore cum_pct rename=(cum_pct=f1));
run;
/* Compute EDF (in %) for SOURCE='smp' */
proc freq data=final noprint;
where source='smp';
tables ascore / outcum out=edf2(keep=ascore cum_pct rename=(cum_pct=f2));
run;
/* Compute pooled EDF (in %) */
proc freq data=final noprint;
tables ascore / outcum out=edf0(keep=ascore cum_pct rename=(cum_pct=f));
run;
/* Combine all three EDFs */
data edf_all;
merge edf0-edf2;
by ascore;
_=.;
run;
/* Fill missing values */
data edf(drop=_);
update edf_all(obs=0) edf_all;
by _;
output;
run;
/* Prepare computation of Kolmogorov-Smirnov statistic */
data edfks;
set edf;
s=sqrt((&n1*(sum(f1,-f))**2+&n2*(sum(f2,-f))**2)/(&n1+&n2));
run;
/* Compute Kolmogorov-Smirnov statistic */
proc sql;
select max(s)/100 as KS format=best16.
from edfks;
quit;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I don't know what to think of any calculations you did in Excel.
I guess the real question here is why don't you trust SAS in this case? The people at SAS spend a lot of time verifying that their calculations are correct. What is your concern, why do you need to verify "manually"?
Did you use the code from @Rick_SAS to do this calculation in PROC IML? Did you plot the data to see if the SAS results makes sense from the plot?
Paige Miller
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I have seen the formulas in SAS guide but look complex and very time consuming. I have also seen posts from @Rick_SAS but i could not translate them into my problem.
I have a population of data from which i take a sample from this population. I use KS to understand if the sample from this population is different. This is the question i am trying to answer here
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I have a population of data from which i take a sample from this population. I use KS to understand if the sample from this population is different. This is the question i am trying to answer here
Do you mean a random sample, or some other type of sample? If random, I'm sure the sample differs in some small way from the population. What I am struggling with is why this is even a question at all. From a statistical point of view, I can't recall seeing an example (or a reason) to test if a random sample differs from the entire population.
i am not sure that i understand how npar1way estimates KS. I mean i don't know if there are any assumptions which are taken in background of the npar1way which affect the results.
Explain this. Why don't you understand? What about it don't you understand? Is it that you don't understand KS (which is a different question than what SAS is doing)? I don't think SAS builds in any assumptions that are not in the definition of a KS test.
Again, did you plot the data and take a look at the graphics? Does the KS calculations make sense from what you can see in the plot? If it doesn't make sense, show us and explain.
Paige Miller
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Do you mean a random sample, or some other type of sample? If random, I'm sure the sample differs in some small way from the population. What I am struggling with is why this is even a question at all. From a statistical point of view, I can't recall seeing an example (or a reason) to test if a random sample differs from the entire population.
Yes, this is a random sampling using Proc Survey. We expect the sample to be some how different from the population but not completely because it will lose some of the characteristics of the population
Explain this. Why don't you understand? What about it don't you understand? Is it that you don't understand KS (which is a different question than what SAS is doing)? I don't think SAS builds in any assumptions that are not in the definition of a KS test.
I mean what are the calculations behind the npar1way since for many functions there are alternative ways to do the same thing. For example, in some functions i have seen they use only continuous variables
Again, did you plot the data and take a look at the graphics? Does the KS calculations make sense from what you can see in the plot? If it doesn't make sense, show us and explain.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@Toni2 wrote:
Do you mean a random sample, or some other type of sample? If random, I'm sure the sample differs in some small way from the population. What I am struggling with is why this is even a question at all. From a statistical point of view, I can't recall seeing an example (or a reason) to test if a random sample differs from the entire population.
Yes, this is a random sampling using Proc Survey. We expect the sample to be some how different from the population but not completely because it will lose some of the characteristics of the population
Why? What do you hope to learn?
I mean what are the calculations behind the npar1way since for many functions there are alternative ways to do the same thing. For example, in some functions i have seen they use only continuous variables
Is your question really "what are the calculation of the KS Test"? I don't think SAS is doing anything other than the standard calculations for KS test here, which you can learn more about at Wikipedia.
Paige Miller
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
PROC UNIVARIATE performs a one-sample KS test. This is the test that I have blogged about.
PROC NPAR1WAY performs a two-sample test. The Wikipedia article for a two-sample test uses the max distance between the two ECDFs to define the KS statistic. However, the doc for NPAR1WAY gives a formula (for k-samples) that depends on the differences between the ECDFs and the pooled ECDF.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for your response here. I have three questions for the npar1way if you can advise please
1) Does the above different approaches lead to different KS?
2) Could we use npar1way with any numeric variable (continuous, discrete etc.)?
3) Could we use npar1way for character variables?
Thanks again
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
1) Possibly yes, but I haven't worked through the math.
2) The KS (and all ECDF tests) are for numeric continuous variables. The CLASS variable can (and should) be discrete.
3) No, see (2).
Bonus: Here is a link to the NPAR1WAY doc. The Overview and Getting Started sections describe the capabilities of the procedure.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
thanks for your support 🙂
Your points are very helpful since it seems sometimes we do things without thinking!
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi @Toni2,
@Toni2 wrote:
manual KS : 0.059424551
npar1way : 0.066667 (D =0.133333)
I can't find why the outcome is not the same ?
To get back to your original question: Your "manual KS" uses cumulative sums of scores (relative to overall totals of scores), whereas the values of the empirical distribution function (EDF) are proportions of cumulative numbers of scores, relative to the total number of scores. So, in the calculation you need to count scores (to find out how many of them are less than or equal to some value), not to add scores. The cumulative percentages that you can obtain with PROC FREQ are basically values of the EDF in percent.
Do you trust the rightmost column "Cumulative Percent" of a basic PROC FREQ output? If so, you can use PROC FREQ to compute the EDFs "manually" and then apply the formula of the KS statistic from the PROC NPAR1WAY documentation.
Here's how (using your dataset FINAL with variable application_score renamed to ascore for brevity):
/* Store number of scores from 'pop' and 'smp' in macro variables n1, n2 */
proc sql noprint;
select count(ascore) into :n1 trimmed
from final(where=(source='pop'));
select count(ascore) into :n2 trimmed
from final(where=(source='smp'));
quit;
/* Compute EDF (in %) for SOURCE='pop' */
proc freq data=final noprint;
where source='pop';
tables ascore / outcum out=edf1(keep=ascore cum_pct rename=(cum_pct=f1));
run;
/* Compute EDF (in %) for SOURCE='smp' */
proc freq data=final noprint;
where source='smp';
tables ascore / outcum out=edf2(keep=ascore cum_pct rename=(cum_pct=f2));
run;
/* Compute pooled EDF (in %) */
proc freq data=final noprint;
tables ascore / outcum out=edf0(keep=ascore cum_pct rename=(cum_pct=f));
run;
/* Combine all three EDFs */
data edf_all;
merge edf0-edf2;
by ascore;
_=.;
run;
/* Fill missing values */
data edf(drop=_);
update edf_all(obs=0) edf_all;
by _;
output;
run;
/* Prepare computation of Kolmogorov-Smirnov statistic */
data edfks;
set edf;
s=sqrt((&n1*(sum(f1,-f))**2+&n2*(sum(f2,-f))**2)/(&n1+&n2));
run;
/* Compute Kolmogorov-Smirnov statistic */
proc sql;
select max(s)/100 as KS format=best16.
from edfks;
quit;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
great! thanks. Quick questions since i have read a lot on Google.
I want to compare two samples to understand if there is significant difference between them. Do i need to use the KS or the D statistic in SAS since i am not sure that i can understand the differences?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@Toni2 wrote:
I want to compare two samples to understand if there is significant difference between them. Do i need to use the KS or the D statistic in SAS since i am not sure that i can understand the differences?
So you want to perform a statistical test. The p-value of the asymptotic Kolmogorov-Smirnov two-sample test in the default output of PROC NPAR1WAY is denoted as "Pr > KSa", so apparently refers to the asymptotic Kolmogorov-Smirnov statistic KSa (=KS*sqrt(n)). When you specify the D option of the PROC NPAR1WAY statement, the output contains a p-value "Pr > D" instead (referring to the D statistic), which was identical with "Pr > KSa" (in the case of two samples) in all test cases that I've run. The documentation describes the p-value in terms of the D statistic.
Since you have only two samples ("class levels"), so D is applicable, and the definition of D is simpler, I would use the D statistic (also in view of the equality of p-values mentioned above).