One can do Kolmogorov–Smirnov test, Cramer–von Mises test, and Anderson–Darling test in UNIVARIATE to do a distributional hypothesis test. For example, the following generates normal observations with μ=σ=10.
data _;
call streaminit(1);
do i=1 to 5000;
x=rand("normal",10,10);
output;
end;
run;
For these observations, one can use UNIVARIATE to check whether the observations are normal with the parameters.
proc univariate;
var x;
histogram/normal(mu=10,sigma=10);
run;
SAS provides some other distributions such as the beta distribution, the exponential distribution, the gamma distribution as well (http://support.sas.com/documentation/cdl/en/procstat/66703/HTML/default/procstat_univariate_syntax09...), but is there another way to do this for the other distributions? For example, one cannot use UNIVARIATE if the hypothetical distribution is the F distribution with d1=d2=10 as follows.
data _;
call streaminit(1);
do i=1 to 5000;
x=rand("f",10,10);
output;
end;
run;
I also found PROC NPAR1WAY EDF, but it provides a two-way test that compares two different samples, while I need to do the usual distribution test. Thanks in advance.
Rick, I understand the difference between the sampling distribution and the data distribution, but sometimes one simulates the sampling distribution and has the data that contain the simulated sampling distribution. For example, one can check whether the sampling distribution of the sample variance from the normal data is the chi-squared distribution—https://newonlinecourses.science.psu.edu/stat414/node/174/. According to the document, the sampling distribution of the sample variance with a proper scale is the chi-squared distribution. In this case, we can simulate the sample variance 5,000 times for example.
data sample;
do s=1 to 5000;
do i=1 to 100;
x=rannor(1);
output;
end;
end;
run;
proc means noprint;
var x;
by s;
output out=stat var=s2;
run;
data stat;
set stat;
s2_=99*s2/1;
run;
In the code, x is the normal variable (100 observations for each of 5,000 simulations), s2 is the sample variance, and s2_ is the scaled one. To see whether s2_ is the chi-squared distribution, one can use UNIVARIATE as it has instead the gamma distribution more generally—as you mentioned in your post. The chi-squared distribution with 99 is the gamma distribution with 0, 2, and 99/2, respectively.
proc univariate noprint;
var s2_;
histogram/gamma(theta=0,sigma=2,alpha=49.5);
run;
In this working code, as expected, all the Kolmogorov–Smirnov, Cramer–von Mises, and Anderson–Darling tests do not reject the null. If GAMMA is unavailable, then what I found instead is an indirect two-sample test using NPAR1WAY—generate some random numbers from the target distribution, and test if the above simulated distribution and the generated distribution just before are reasonably close.
data simulate;
_type_=1;
do s=1 to 5000;
s2_=rand("chisq",99);
output;
end;
run;
proc append base=stat;
run;
Then, the stat data set contains 5,000 simulated sample variances as _type_=0 and 5,000 random numbers from the chi-squared distribution as _type_=1.
proc npar1way edf;
var s2_;
class _type_;
run;
As expected again, both Kolmogorov–Smirnov and Kuiper tests do not reject the null hypothesis. As aforementioned, I hope that someday UNIVARIATE embraces other distributions as well since often the data are from those distributions such as the chi-squared or F distributions.
I wrote about this issue in the article "Why doesn't PROC UNIVARIATE support certain common distributions?"
Basically, the answer is that PROC UNIVARIATE supports common DATA distribution. The F, t, chi-square distributions are distributions that arise in studying the sampling distribution of statistics for certain hypothesis tests.
Rick, I understand the difference between the sampling distribution and the data distribution, but sometimes one simulates the sampling distribution and has the data that contain the simulated sampling distribution. For example, one can check whether the sampling distribution of the sample variance from the normal data is the chi-squared distribution—https://newonlinecourses.science.psu.edu/stat414/node/174/. According to the document, the sampling distribution of the sample variance with a proper scale is the chi-squared distribution. In this case, we can simulate the sample variance 5,000 times for example.
data sample;
do s=1 to 5000;
do i=1 to 100;
x=rannor(1);
output;
end;
end;
run;
proc means noprint;
var x;
by s;
output out=stat var=s2;
run;
data stat;
set stat;
s2_=99*s2/1;
run;
In the code, x is the normal variable (100 observations for each of 5,000 simulations), s2 is the sample variance, and s2_ is the scaled one. To see whether s2_ is the chi-squared distribution, one can use UNIVARIATE as it has instead the gamma distribution more generally—as you mentioned in your post. The chi-squared distribution with 99 is the gamma distribution with 0, 2, and 99/2, respectively.
proc univariate noprint;
var s2_;
histogram/gamma(theta=0,sigma=2,alpha=49.5);
run;
In this working code, as expected, all the Kolmogorov–Smirnov, Cramer–von Mises, and Anderson–Darling tests do not reject the null. If GAMMA is unavailable, then what I found instead is an indirect two-sample test using NPAR1WAY—generate some random numbers from the target distribution, and test if the above simulated distribution and the generated distribution just before are reasonably close.
data simulate;
_type_=1;
do s=1 to 5000;
s2_=rand("chisq",99);
output;
end;
run;
proc append base=stat;
run;
Then, the stat data set contains 5,000 simulated sample variances as _type_=0 and 5,000 random numbers from the chi-squared distribution as _type_=1.
proc npar1way edf;
var s2_;
class _type_;
run;
As expected again, both Kolmogorov–Smirnov and Kuiper tests do not reject the null hypothesis. As aforementioned, I hope that someday UNIVARIATE embraces other distributions as well since often the data are from those distributions such as the chi-squared or F distributions.
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.