BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
Toni2
Lapis Lazuli | Level 10

hi, i have spent some time searching on Google to see how i can manually estimate the KS for two samples and then i estimated them using Excel and compared the results with proc npar1way.

 

Below are the results for the same data :

 

manual KS : 0.059424551 

npar1way : 0.066667 (D =0.133333)

 

I can't find why the outcome is not the same ? 

 

In the attached excel is the raw data and the manual estimation (calculations can be found in cells)

 

In addition, in the SAS tab i have provided the data as i use them in SAS to run the npar1way

 

ods graphics on;
proc npar1way edf plots=edfplot data=final;
class source;
var application_score;
output out=stat edf;
/*exact ks;*/
run;
ods graphics off;

 

1 ACCEPTED SOLUTION

Accepted Solutions
FreelanceReinh
Jade | Level 19

Hi @Toni2,


@Toni2 wrote:

manual KS : 0.059424551 

npar1way : 0.066667 (D =0.133333)

 

I can't find why the outcome is not the same ? 


To get back to your original question: Your "manual KS" uses cumulative sums of scores (relative to overall totals of scores), whereas the values of the empirical distribution function (EDF) are proportions of cumulative numbers of scores, relative to the total number of scores. So, in the calculation you need to count scores (to find out how many of them are less than or equal to some value), not to add scores. The cumulative percentages that you can obtain with PROC FREQ are basically values of the EDF in percent

 

Do you trust the rightmost column "Cumulative Percent" of a basic PROC FREQ output? If so, you can use PROC FREQ to compute the EDFs "manually" and then apply the formula of the KS statistic from the PROC NPAR1WAY documentation.

 

Here's how (using your dataset FINAL with variable application_score renamed to ascore for brevity):

/* Store number of scores from 'pop' and 'smp' in macro variables n1, n2 */

proc sql noprint;
select count(ascore) into :n1 trimmed
from final(where=(source='pop'));
select count(ascore) into :n2 trimmed
from final(where=(source='smp'));
quit;

/* Compute EDF (in %) for SOURCE='pop' */

proc freq data=final noprint;
where source='pop';
tables ascore / outcum out=edf1(keep=ascore cum_pct rename=(cum_pct=f1));
run;

/* Compute EDF (in %) for SOURCE='smp' */

proc freq data=final noprint;
where source='smp';
tables ascore / outcum out=edf2(keep=ascore cum_pct rename=(cum_pct=f2));
run;

/* Compute pooled EDF (in %) */

proc freq data=final noprint;
tables ascore / outcum out=edf0(keep=ascore cum_pct rename=(cum_pct=f));
run;

/* Combine all three EDFs */

data edf_all;
merge edf0-edf2;
by ascore;
_=.;
run;

/* Fill missing values */

data edf(drop=_);
update edf_all(obs=0) edf_all;
by _;
output;
run;

/* Prepare computation of Kolmogorov-Smirnov statistic */

data edfks;
set edf;
s=sqrt((&n1*(sum(f1,-f))**2+&n2*(sum(f2,-f))**2)/(&n1+&n2));
run;

/* Compute Kolmogorov-Smirnov statistic */

proc sql;
select max(s)/100 as KS format=best16.
from edfks;
quit;

View solution in original post

14 REPLIES 14
PaigeMiller
Diamond | Level 26

I don't know what to think of any calculations you did in Excel. 

 

I guess the real question here is why don't you trust SAS in this case? The people at SAS spend a lot of time verifying that their calculations are correct. What is your concern, why do you need to verify "manually"?

 

Did you use the code from @Rick_SAS to do this calculation in PROC IML? Did you plot the data to see if the SAS results makes sense from the plot?

--
Paige Miller
Toni2
Lapis Lazuli | Level 10
Valid point! i am not sure that i understand how npar1way estimates KS. I mean i don't know if there are any assumptions which are taken in background of the npar1way which affect the results.

I have seen the formulas in SAS guide but look complex and very time consuming. I have also seen posts from @Rick_SAS but i could not translate them into my problem.

I have a population of data from which i take a sample from this population. I use KS to understand if the sample from this population is different. This is the question i am trying to answer here
PaigeMiller
Diamond | Level 26

I have a population of data from which i take a sample from this population. I use KS to understand if the sample from this population is different. This is the question i am trying to answer here

Do you mean a random sample, or some other type of sample? If random, I'm sure the sample differs in some small way from the population. What I am struggling with is why this is even a question at all. From a statistical point of view, I can't recall seeing an example (or a reason) to test if a random sample differs from the entire population.

 

 

i am not sure that i understand how npar1way estimates KS. I mean i don't know if there are any assumptions which are taken in background of the npar1way which affect the results.

 

Explain this. Why don't you understand? What about it don't you understand? Is it that you don't understand KS (which is a different question than what SAS is doing)? I don't think SAS builds in any assumptions that are not in the definition of a KS test.

 

Again, did you plot the data and take a look at the graphics? Does the KS calculations make sense from what you can see in the plot? If it doesn't make sense, show us and explain.

--
Paige Miller
Toni2
Lapis Lazuli | Level 10

Do you mean a random sample, or some other type of sample? If random, I'm sure the sample differs in some small way from the population. What I am struggling with is why this is even a question at all. From a statistical point of view, I can't recall seeing an example (or a reason) to test if a random sample differs from the entire population.


Yes, this is a random sampling using Proc Survey. We expect the sample to be some how different from the population but not completely because it will lose some of the characteristics of the population

 


Explain this. Why don't you understand? What about it don't you understand? Is it that you don't understand KS (which is a different question than what SAS is doing)? I don't think SAS builds in any assumptions that are not in the definition of a KS test.

I mean what are the calculations behind the npar1way since for many functions there are alternative ways to do the same thing. For example, in some functions i have seen they use only continuous variables

 

Again, did you plot the data and take a look at the graphics? Does the KS calculations make sense from what you can see in the plot? If it doesn't make sense, show us and explain.

 
Toni2
Lapis Lazuli | Level 10
I tried twice to edit my comment and add plots but i could not do. So the plot from SAS makes sense to describe KS the Excel graph looks a bit different but the trend is the same. I am not sure if this is because of the scale or selected the wrong graph
PaigeMiller
Diamond | Level 26

@Toni2 wrote:

Do you mean a random sample, or some other type of sample? If random, I'm sure the sample differs in some small way from the population. What I am struggling with is why this is even a question at all. From a statistical point of view, I can't recall seeing an example (or a reason) to test if a random sample differs from the entire population.


Yes, this is a random sampling using Proc Survey. We expect the sample to be some how different from the population but not completely because it will lose some of the characteristics of the population


 

Why? What do you hope to learn?

 

I mean what are the calculations behind the npar1way since for many functions there are alternative ways to do the same thing. For example, in some functions i have seen they use only continuous variables

 

Is your question really "what are the calculation of the KS Test"? I don't think SAS is doing anything other than the standard calculations for KS test here, which you can learn more about at Wikipedia.

 

--
Paige Miller
Rick_SAS
SAS Super FREQ

PROC UNIVARIATE performs a one-sample KS test. This is the test that I have blogged about.

 

PROC NPAR1WAY performs a two-sample test. The Wikipedia article for a two-sample test uses the max distance between the two ECDFs to define the KS statistic. However, the doc for NPAR1WAY gives a formula (for k-samples) that depends on the differences between the ECDFs and the pooled ECDF. 

Toni2
Lapis Lazuli | Level 10

Thanks for your response here. I have three questions for the npar1way if you can advise please

 

1) Does the above different approaches lead to different KS? 

 

2) Could we use npar1way with any numeric variable (continuous, discrete etc.)?

 

3) Could we use npar1way for character variables? 

 

Thanks again

 

 

Rick_SAS
SAS Super FREQ

1) Possibly yes, but I haven't worked through the math. 

2) The KS (and all ECDF tests) are for numeric continuous variables. The CLASS variable can (and should) be discrete.

3) No, see (2).

 

Bonus: Here is a link to the NPAR1WAY doc.  The Overview and Getting Started sections describe the capabilities of the procedure.

Toni2
Lapis Lazuli | Level 10
thank you again, this is really useful!
Toni2
Lapis Lazuli | Level 10

thanks for your support 🙂

 

Your points are very helpful since it seems sometimes we do things without thinking! 

FreelanceReinh
Jade | Level 19

Hi @Toni2,


@Toni2 wrote:

manual KS : 0.059424551 

npar1way : 0.066667 (D =0.133333)

 

I can't find why the outcome is not the same ? 


To get back to your original question: Your "manual KS" uses cumulative sums of scores (relative to overall totals of scores), whereas the values of the empirical distribution function (EDF) are proportions of cumulative numbers of scores, relative to the total number of scores. So, in the calculation you need to count scores (to find out how many of them are less than or equal to some value), not to add scores. The cumulative percentages that you can obtain with PROC FREQ are basically values of the EDF in percent

 

Do you trust the rightmost column "Cumulative Percent" of a basic PROC FREQ output? If so, you can use PROC FREQ to compute the EDFs "manually" and then apply the formula of the KS statistic from the PROC NPAR1WAY documentation.

 

Here's how (using your dataset FINAL with variable application_score renamed to ascore for brevity):

/* Store number of scores from 'pop' and 'smp' in macro variables n1, n2 */

proc sql noprint;
select count(ascore) into :n1 trimmed
from final(where=(source='pop'));
select count(ascore) into :n2 trimmed
from final(where=(source='smp'));
quit;

/* Compute EDF (in %) for SOURCE='pop' */

proc freq data=final noprint;
where source='pop';
tables ascore / outcum out=edf1(keep=ascore cum_pct rename=(cum_pct=f1));
run;

/* Compute EDF (in %) for SOURCE='smp' */

proc freq data=final noprint;
where source='smp';
tables ascore / outcum out=edf2(keep=ascore cum_pct rename=(cum_pct=f2));
run;

/* Compute pooled EDF (in %) */

proc freq data=final noprint;
tables ascore / outcum out=edf0(keep=ascore cum_pct rename=(cum_pct=f));
run;

/* Combine all three EDFs */

data edf_all;
merge edf0-edf2;
by ascore;
_=.;
run;

/* Fill missing values */

data edf(drop=_);
update edf_all(obs=0) edf_all;
by _;
output;
run;

/* Prepare computation of Kolmogorov-Smirnov statistic */

data edfks;
set edf;
s=sqrt((&n1*(sum(f1,-f))**2+&n2*(sum(f2,-f))**2)/(&n1+&n2));
run;

/* Compute Kolmogorov-Smirnov statistic */

proc sql;
select max(s)/100 as KS format=best16.
from edfks;
quit;
Toni2
Lapis Lazuli | Level 10

great! thanks. Quick questions since i have read a lot on Google.

 

I want to compare two samples to understand if there is significant difference between them. Do i need to use the KS or the D statistic in SAS since i am not sure that i can understand the differences?  

FreelanceReinh
Jade | Level 19

@Toni2 wrote:

I want to compare two samples to understand if there is significant difference between them. Do i need to use the KS or the D statistic in SAS since i am not sure that i can understand the differences?  


So you want to perform a statistical test. The p-value of the asymptotic Kolmogorov-Smirnov two-sample test in the default output of PROC NPAR1WAY is denoted as "Pr > KSa", so apparently refers to the asymptotic Kolmogorov-Smirnov statistic KSa (=KS*sqrt(n)). When you specify the D option of the PROC NPAR1WAY statement, the output contains a p-value "Pr > D" instead (referring to the D statistic), which was identical with "Pr > KSa" (in the case of two samples) in all test cases that I've run. The documentation describes the p-value in terms of the D statistic.

 

Since you have only two samples ("class levels"), so D is applicable, and the definition of D is simpler, I would use the D statistic (also in view of the equality of p-values mentioned above).

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 14 replies
  • 3421 views
  • 6 likes
  • 4 in conversation