- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hello, guys
I am reading a textbooks of mathematical statistics and it is funny and precise. I wonder:
Is there any procedure to get f_hat(x) in SAS? Thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hello @TomHsiung
For this issue, you might be best served by contacting Technical Support and requesting SAS Content Assessment Tool support.
Here is a link for your convenience: https://support.sas.com/en/technical-support.html#contact
Thank you for using SAS!
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thank you for the post- we are responding in your TS case. Thanks, and have a great day!
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
What you refer to is in fact a kernel estimate of the probability density function (PDF). Kernel estimators of the PDF can readily be obtained in basic SAS procedures like PROC UNIVARIATE. For instance, if I wish to obtain such an estimator of a variable named y, I can use the following code to let SAS do the job:
proc univariate data=a;
var y;
histogram/kernel;
run;
Or, in PROC SGPLOT:
proc sgplot data=a;
histogram y;
density y/type=kernel(weight=normal) legendlabel='normal kernel';
density y/type=kernel(weight=quadratic) legendlabel='quadratic';
density y/type=kernel(weight=triangular) legendlabel='triangular';
run;
Only normal, quadratic and triangular kernels are supported in the two procedures. The rectangular kernel does not seem to be supported. However, my experience on kernel desity estimation tells me that the choice of kernel is inconsequential on the PDF estimate. PDF estimators under different kernel functions look very much alike. Of more importance is the choice of bandwidth, which is expressed as 2h in your book.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@TomHsiung wrote:
Hi, Season. Thank you for your answer. Yep, the bandwidth determines the degree of the approximation of unbiased estimation. The smaller, the better, I think. Can we set it manually in PROC UNIVARIATE or PROC SGPLOT? Thx
Controlling the bandwidth is easy in SAS. They can be manipulated by the c= option, which is one of the few kernel options in PROC UNIVARIATE and PROC SGPLOT. Like this:
proc univariate data=a;
var y;
histogram/kernel(c=1);
run;
And this:
proc sgplot data=a;
histogram y;
density y/type=kernel(weight=normal c=1) legendlabel='normal kernel';
density y/type=kernel(weight=quadratic c=1) legendlabel='quadratic';
density y/type=kernel(weight=triangular c=1) legendlabel='triangular';
run;
@TomHsiung wrote:
Hi, Season. Thank you for your answer. Yep, the bandwidth determines the degree of the approximation of unbiased estimation. The smaller, the better, I think.
Not really. This is the issue that I would like to underscore. In fact, the choice of bandwidth is a trade-off between the specifics and trends of the data. Were the bandwidth too large, then the kernel gave a general summarization of the data and was not capable of capturing the ups and downs of a PDF as it fluctuates. On the other hand, if the bandwidth was too small, then minute oscillations from the true PDF was also captured by the kernel density estimate, making it a blend mixture of both the trend of the PDF and stochastic noise. The objective of using the kernel function to estimate the PDF is to capture the trend and ignore the noise. That involves a trand-off between the specifics and trends of the data, which is not easy on most occasions. That is why the choice of bandwidth is important. In practice, the selection of bandwidth is frequently "tainted" by the user's subjectivity.
There are lots of papers on nonparametric statistics that deal with bandwidth selection. I am not sure if someone has invented a data-driven or objective method of dealing with this problem. You can search on the web to find out.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
But we need to approximate the ξ as close to x as possible to get the unbiased estimation.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Good question. Actually, I had not thought about this question before. I agree that in theory, choosing a very small bandwidth will lead to PDF estimators closer to the true function. I guess the difference between theory and practice is caused by the limited sample size that a researcher usually have.
For instance, if I set the bandwidth to be a minuscule positive value- say 1/1 million- and I have a handful of observations- say 50. Then suppose that for a particular interval whose center is named x0, one observation happens to fall into it. Therefore, according to formula (4.1.13) depicted in your first screenshot, the estimated density (i.e., f(x0)) would be 1/((1/1 million)*50)- a very large number. However, as the number of observations is small, for the next interval, namely (x0+0.5/1 million, x0+1.5/1 million), the number of observations that fall into this interval might be zero. Therefore, f(x0+1/1 million)=0. See? The PDF swings from numbers in millions to zero wildly, simply because of the presence/absence of a mere one observation. This is why choosing too small an interval brings about noise.
In addition, consider the situation that if we have two observations instead of one in the first interval and all other conditions remain the same. In this scenario, f(x0) simply doubles and dives to zero more dramatically as we go on to evaluate f(x0+1/1 million). In short, the estimated PDF is not robust against tiny alterations in the data.
However, if you have a very large sample, the number of observations that fall into both intervals are both large and stable. You can get a PDF close to the true function despite a small bandwidth.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@Season Thanks for the clarification. Yep, a small sample size results a very large f(x)_hat, given we choose a very small interval. The estimated PDF would be some spike high. But the area must not exceed 1.00
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
You are right. But look at the formula itself. By the formula of the kernel density estimation, only the counts in a given interval, the bandwidth and sample size play roles in determining its value. How could you prove that if you integrate the estimated PDF over the enterity of its support, the resulting value would be 1?
By the way, do not think that "PDF's" with area under curve larger than 1 is an absurd idea. If you take a look at Bayesian statistics, you will notice a concept termed "improper prior". If you integrate those priors over their entire supports, you will get results larger than 1. Strictly speaking, these "improper priors" are not PDF's at all. But they are still used in Bayesian statistics.
By digressing to Bayesian statistics, I by no means endorse the notion that the area under the PDF curve can be larger than 1. I just want to say that nonparametric estimators like kernel density estimators are flexible and are less dependent on the quality or nature of the data at hand. On many occasions, such flexibilities also imply simple calculation formulae. However, it is also because of this that it may not be an all-round method that can meet all standards while maintaining its flexibility.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hello @TomHsiung,
I am fairly sure that it is possible to implement that rectangular kernel function ("f_hat") in PROC FCMP: The READ_ARRAY function, for example, could read the dataset containing the values Xi into an array. Based on this array and the bandwidth (passed to the function as an argument) the rectangular kernel could be computed. As this is a piecewise constant function, implementing it amounts to storing a lookup table, which could be done by means of an ordered hash object. Now f_hat(x) for an arbitrary argument x could be calculated by sequentially searching the hash object.
Sadly, I don't have the time right now to work on this and to test the performance of such a PROC FCMP function. Given the availability of other kernel density estimation methods in PROC UNIVARIATE and PROC SGPLOT, as @Season has greatly explained already, and also PROC KDE and several other procedures (see keyword "Kernel Density Estimation" in Usage Note 30333: FASTats: Frequently Asked-For Statistics), implementing your own function might not be necessary.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thanks guys, I learned a lot from your kind suggestions. Finally, I chose the PROC SGPLOT because I can plot two or more diagrams in the same graph based on a subgroup determinator. Much appreciate.
Also, I will keep reading the book I inserted as screenshots, in order to improve my understanding of mathematical statistics.
Have a great day!
Tom
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I am glad that my efforts paid off. As a Chinese saying goes, "Teaching others teaches yourself" (教学相长), I also reinforced my understanding of kernel density estimation that I learnt about two years ago. Please keep raising interesting questions, as this is the key to scientific innovation.
A piece of information that I would like to deliver is that the BY statement and CLASS statement in PROC UNIVARIATE also support separate analyses on each of the groups designated by the elements of the variable in the statement.
Since you show some interest in kernel density estimation, I searched a few monographs on density estimation for you to reference, if you wish:
Good luck on your study and exploration of statistics!
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Assuming this is your datasets with realisation of random variables X_i:
data data_set_X_i;
call streaminit(42);
do _N_ = 1 to 123;
X_i = RAND("NORMAL", 10, 5);
output;
end;
run;
you could do calculation of f_hat like this(using hash tables [not robust]):
data want;
X_i=.;
declare hash XX;
h = 2.5;
n = 123;
do x=-10 to 30 by 0.1;
XX = _new_ hash (dataset:cats("data_set_X_i(where=(",x-h,"<=X_i<=",x+h,"))"), multidata:"Y");
XX.defineKey("X_i");
XX.defineDone();
f_hat = XX.num_items/(2*h*n);
XX.delete();
output;
end;
run;
/*
proc print;
run;
*/
proc sgplot data=want;
series x=x y = f_hat;
run;
or with arrays (faster):
[EDIT: no sorting needed]
data want2;
array XX[123] _temporary_;
do until(end);
set data_set_X_i curobs=curobs end=end;
XX[curobs] = X_i;
end;
h = 2.5;
n = 123;
do x=-10 to 30 by 0.1;
num_items=0;
do _N_ = 1 to n;
num_items + (x-h<=XX[_N_]<=x+h);
end;
f_hat = num_items/(2*h*n);
output;
end;
stop;
run;
proc sgplot data=want2;
series x=x y = f_hat;
run;
For CALL STREAMINIT(42) and RAND("NORMAL", 10, 5), and H=2.5, nnd N=123, both result with the following plot:
Bart
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug
"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings
SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation