Overlay chi-squared distribution on histogram (proc univariate)

Report Inappropriate Content · Posted 11-18-2015 09:12 AM

So imagine I have a data that I believe is chi-squared distributed. I want to make a histogram in proc univariate or another program, and then overlay the chi-square distribution.

Seems easy enough, though wondering why the chi-square is not an option in Univariate?

Sam_SAS · Posted 11-18-2015 09:32 AM

@AnnaBrown I believe this should be on the SAS Procedures community?

FreelanceReinh · Posted 11-18-2015 12:07 PM

If you mean the (common) chi-squared distribution and not the non-central chi-squared distribution, I think you can make use of the facts that

The chi-squared distribution with n degrees of freedom is equal to the gamma distribution with shape parameter n/2 and scale parameter 2. In terms of SAS syntax: pdf('CHISQ', x, df) = pdf('GAMMA', x, df/2, 2).
Unlike the chi-squared distribution, the gamma distribution is among the fitted continuous distributions which PROC UNIVARIATE offers.

For other, even user-defined density curves there is an article How to overlay a custom density curve on a histogram in SAS.

Report Inappropriate Content · Posted 11-18-2015 02:25 PM

Yes, I saw this in a couple of places - including Wicklin's simulation book, but needed more confirmation to feel totally confortable. So I used the below, since I had 2 degrees of freedom on the chi^2 distribution. Please let me know if this seems incorrect

PROC UNIVARIATE DATA=plotdata;
   VAR x;
   HISTOGRAM x / gamma(alpha=1 sigma=2); 
RUN;

FreelanceReinh · Posted 11-18-2015 03:55 PM

For a basic plot this should be fine.

You can create a large sample of rand('CHISQ',2) values to see the good fit. With this simulated data you can also change the parameter values for alpha and sigma to EST in order to let SAS estimate the parameters. You will see how close these estimates will be to 1 and 2, respectively.

Rick_SAS · Posted 11-19-2015 09:07 AM

1. A chi-square distribution with d degrees of freedom is equivalent to a Gamma(d/2, 2) distribution, so, yes, you can use the gamma distribution to overlay a chi-square curve.

2) In general, the way to overlay a known probability density to a sampling distribution (presumably created through Monte Carlo simulation or bootstrapping) is to use the GTL. Since you refer to my book, see p. 40-41, Also see the article "How to overlay a custom density curve on a histogram in SAS."

3) You asked "why the chi-square is not an option in Univariate." The answer is that UNIVARIATE models data distributions, and real-world data is rarely generated by a process that gives rise to t, F, or chi-square distribution. Those distributions are used to describe the sampling distribution of statistics. That is, they arise from a theoretical investigation of how a statistic varies across many random samples of data. Consequently, we don't usually fit the parameters in the t, F, and chi-square families. Instead, the parameters (usually called degrees of freedom) are determined by the sample size of the data and are used for inference, such as testing hypotheses, forming confidence intervals, and computing p-values.

Report Inappropriate Content · Posted 11-19-2015 01:49 PM

Thank you Freelance and Rick!

Yes, my first approach was to simulate these data when I did not see the chi^2 option. I then better educated myself, and realized the chi^2 was a specal case of the gamma distribution. This seemed familiar, but I was a little mistrusting of going forward without confirmation.

Funny enough Rick, I currently had the simulation book open to page 39 when writing this reply. I had seen the code you mentioned, but was falsely discourage when seeing how long it was. I know, I want flexibility of writing code, but also point and click options at times. To put this in perspective, I am actually using this as a comparative for generated Mahalanobis squared distances - in the pursuit of examining for outlers. I had seen your Do Loop piece on this and it helped in my understanding and I was hoping to use it as a complement to the chi^2 Quantile plots.

Question, are the robust Mahalanobis distances more appropriate for data that may be questionably multivariate normal?

Rick_SAS · Posted 11-19-2015 04:11 PM

More appropriate for what? Outlier detection? Since the MV means and covariance are influenced by outliers, I would say that if your data are MV normal plus contamination, then yes the robust MD would be a better choice for outlier detection.

Some possibly relevant references:

1. "Detecting outliers in SAS"

2. pp. 7-9 of Wicklin (2010) "Rediscovering SAS/IML Software: Modern Data Analysis for the Practicing Statistician"