i have two datasets A and B
i use the KS statistic to understand if A has come from the B
Now, i have the variable c which is categorical. It takes values between 400 to 744. Each number shows a type of sub-product. For example, 400 means product x while 552 means product y and so on.
I wonder if i can use KS with this variable?
The two-sample KS test is a way to determine whether the distribution of some continuous variable is the same for two groups. For example, the following SAS statements analyze whether the distribution of height is the same for boys and for girls:
proc npar1way data=sashelp.class;
class sex;
var height;
run;
Now let's examine your question. You cannot use the KS test on a categorical variable, but you can use a categorical variable to determine subgroups of the data that you want to test. For example, suppose that you want to analyze the distribution of PRICE at two kinds of stores: convenience stores and grocery stores. You might have several kinds of snacks that you want to analyze, such as potato chips, pretzels, tortilla chips, and so forth. If so, you can use a BY statement to run the analysis for each product.
The following SAS statements analyze the distribution of prices (PRICE) between convenience stores and grocery stores (STORETYPE) for each kind of product (PRODUCT):
proc sort data=Prices;
by Product;
run;
proc npar1way data=Prices ks D;
by Product; /* repeat anlysis for each type of product */
class StoreType; /* 'Convenience' or 'Grocery' */
var Price; /* analyze the distribution of prices */
run;
In your original questions, A and B would be the prices for convenience stores or Grocery stores, respectively. C would be the product type.
Trying to use C (product type) in any other way probably does not give you a correct analysis.
I wonder if i can use KS with this variable?
Probably not. What is the question you want to answer?
@Toni2 wrote:
thanks. The question is : if we can use KS for variable c can extract safe results that A has come from B ?
I don't really know what this means. Much more explanation needed, and we probably need an example in detail, as well. Don't write one brief sentence, that's not what I am looking for when I ask for "much more explanation". Don't talk about KS test, talk about what question you want the data to help you answer.
well, below is a small extract from dataset A for variable c. Dataset A has approx. 350k observations
c
737
701
702
702
742
735
710
731
702
710
This is the a small extract from dataset B for variable c. B has approx. 10m observations
c
407
737
724
701
702
702
724
742
701
710
As i wrote in my initial post c is a categorical variable. Therefore each value corresponds in a characteristic. When i test with KS if A above has come from B, KS passes for c but then i think that observations do not correspond to actual numbers.
On the other hand, if the test takes in consideration the volume of values (for example, observation 700 appears 3 times) then the KS can answer to the question that A has come from B.
The two-sample KS test is a way to determine whether the distribution of some continuous variable is the same for two groups. For example, the following SAS statements analyze whether the distribution of height is the same for boys and for girls:
proc npar1way data=sashelp.class;
class sex;
var height;
run;
Now let's examine your question. You cannot use the KS test on a categorical variable, but you can use a categorical variable to determine subgroups of the data that you want to test. For example, suppose that you want to analyze the distribution of PRICE at two kinds of stores: convenience stores and grocery stores. You might have several kinds of snacks that you want to analyze, such as potato chips, pretzels, tortilla chips, and so forth. If so, you can use a BY statement to run the analysis for each product.
The following SAS statements analyze the distribution of prices (PRICE) between convenience stores and grocery stores (STORETYPE) for each kind of product (PRODUCT):
proc sort data=Prices;
by Product;
run;
proc npar1way data=Prices ks D;
by Product; /* repeat anlysis for each type of product */
class StoreType; /* 'Convenience' or 'Grocery' */
var Price; /* analyze the distribution of prices */
run;
In your original questions, A and B would be the prices for convenience stores or Grocery stores, respectively. C would be the product type.
Trying to use C (product type) in any other way probably does not give you a correct analysis.
Yes, it is wrong to use PROC NPAR1WAY to analyze a discrete variable. The analyses and tests assume that the data are continuous.
You can use PROC FREQ and a chi-square test to test whether the frequency distribution of the products differs between the A group and the B group. For example, the following statements simulate two groups and 10 products. The frequency distribution of the A group is slightly different from the B group. The chi-square test (or a related test) can detect this difference for very large samples, but not for small samples:
data FakeData;
call streaminit(1);
array ProbA[10] (.1 .1 .1 .1 .1 .1 .1 .1 .1 .1);
array ProbB[10] (.08 .1 .1 .1 .11 .1 .09 .1 .12 .1);
Group = 'A';
do i = 1 to 1000;
Product = rand("Table", of ProbA[*]); /* simulate counts for Group='A' */
output;
end;
Group = 'B';
do i = 1 to 600;
Product = rand("Table", of ProbB[*]); /* simulate counts for Group='B' */
output;
end;
proc freq data=FakeData;
tables Group*Product / chisq expected nopercent nocol;
run;
But I can save you the trouble of running the analysis. If one group has 350k observations and the other has 10M observations, then the tests are likely to reject the null hypothesis that the distributions are the same. To understand why I say this, read "Goodness-of-fit tests: A cautionary tale for large and small samples."
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.