Why Distributions are Important
Why are the distributions of your data and your errors so important? Understanding different distributions is key to understanding your data and helping you know how to treat the data and what models to use. SAS Studio lets you graph different probability distribution functions (pdfs). SAS also enables you to quickly create large data sets drawn from different distributions. This post shows you how.
Many models rely on assumptions. For example, ordinary least squares regression makes assumptions, such as:
Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.
If you violate the assumptions, your results will be incorrect. For example, your confidence limits will be wrong. In real life, data and errors usually stray from the normal distribution. Sometimes a large number of observations (high n) will help you out, and you will be close enough to normal that your results will be adequate and your models will still provide reliable results. However, sometimes your data and errors follow a very different distribution. For example, count data commonly follow a Poisson distribution. Data that have been drawn from two separate populations may follow a bi-modal distribution.
So depending on the distribution you discover, you may select a different model. Or you may wish to transform the data prior to modeling. Understanding the distribution is key to understand your data and selecting your best modeling options.
Common Distributions
Let’s see how to create pdfs and simulate data for each of these four distributions.
Normal (Gaussian)
For a normal distribution, the probability density function forms the bell-shaped curve. The mean, median, and mode are all the same for a standard normal distribution. Skewness and kurtosis would both equal zero for a standard normal distribution. Data (and errors) are rarely perfectly normally distributed, but for a large number of observations, they may approximate a normal distribution.
We can easily plot the probability density function of a normal distribution using the SAS PDF function for the normal distribution. Notice that the mean may be represented, by μ, θ, etc. depending on the source. Likewise, the variance may be represented by σ2 or λ2 etc.
Using the default location parameter mean equal to 0 and the default scale parameter standard deviation equal to 1, we get the graph shown below with the following code. Notice that if your standard deviation equals 1 then your variance also equals 1, because the variance is the square of the standard deviation.
/************************
* NORMAL (GAUSSIAN) PDF *
*************************/
data normalpdf;
do x = -3 to 3 by 0.1;
y = pdf("Normal", x);
output;
end;
run;
proc sgplot data=normalpdf noautolegend;
title "Normal Probability Density Function";
series x=x y=y;
xaxis grid label="x";
yaxis display=(nolabel) values = (0 to .5 by .1);
refline 0 / axis=y;
run;
There are different ways to simulate data from specific distributions using SAS. One way is to use SAS/IML with the RANDGEN subroutine to generate random values. The second, and easier way, is to use a simple DATA step with the RAND function. This is the method we will use here.
The DATA step is excellent for simulating univariate data, but SAS/IML is more powerful for simulating multivariate data.
/****************************************************
* Simulate data from NORMAL (GAUSSIAN) distribution *
*****************************************************/
data normal;
do i = 1 to 100000;
x = rand("Normal");
output;
end;
run;
title "Normal Distribution";
proc print data=normal (obs=10);
run;
proc univariate data=normal;
var x;
histogram;
run;
In most cases , however, we do not have normality. We may have other distributions such as Poisson, negative binomial, Gamma, etc. Let’s look at these three distributions.
Poisson
The Poisson distribution models discrete non-negative numbers, usually count data. Examples include the number of fish caught by a waterman in a day, the number births per year in a city, or the number of Barbie dolls purchased per day in a store. The Poisson distribution has one parameter, the mean number of events, commonly represented as λ. Remember that the Poisson distribution is different from the exponential distribution. The exponential distribution describes waiting time between events, whereas the Poisson distribution describes the number of times the event occurs within a period.
Let’s plot the probability density function of a normal distribution using the SAS PDF function for the normal distribution.
We’ll use this code for the following graph.
/**************
* POISSON PDF *
***************/
%let lambda=3;
data poissonpdf;
do k = 0 to 10;
y = pdf("Poisson", k, &lambda);
output;
end;
run;
proc sgplot data=poissonpdf noautolegend;
title "Poisson Probability Density Function for lambda = &lambda";
series x=k y=y;
xaxis grid label="k";
yaxis display=(nolabel) values = (0 to .4 by .1);
refline 0 / axis=y;
run;
We can use the RAND function to create data from the Poisson distribution.
/******************************************
* Simulate data from POISSON distribution *
******************************************/
%let lambda=2;
data Poisson;
do i = 1 to 100000;
x = rand ('POISSON', &lambda);
output;
end;
run;
proc print data=poisson (obs=10);
title "Poisson with lambda = &lambda";
run;
proc univariate data=poisson;
var x;
histogram;
run;
By simply changing our %let macro, we can change lambda to 4 or 16 or whatever we would like. For example,
%let lambda=4;
Binomial and Negative Binomial
The binomial and negative binomial distributions are for sequential independent Bernoulli trials with a binomial outcome, e.g., success or failure. Both distributions assume that the probability of success p is the same for every trial. The binomial distribution models the number of successes in n trials. The negative binomial distribution models the number of failures before k successes occur.
The classic example is repeated coin tosses. With a fair coin, the probability of heads versus tails is 50/50. That means that p = 0.5. Each toss is independent, so each toss has a 50% chance of heads and a 50% chance of tails.
In our code, we use k instead of n to create the pdf.
/************************
* NEGATIVE BINOMIAL PDF *
*************************/
%let k=5;
%let p=0.5;
data negbinomialpdf;
do m = 0 to 20;
y = pdf("NEGBINOMIAL", m, &p, &k);
output;
end;
run;
proc sgplot data=negbinomialpdf noautolegend;
title "Negative Binomial Probability Density Function (m=&m, p=&p)";
series x=m y=y;
xaxis grid label="m"values = (0 to 20 by 5);
yaxis display=(nolabel) values = (0 to .2 by .05);
refline 0 / axis=y;
run;
And we can again use the RAND function to simulate data, this time from the negative binomial distribution.
/****************************************************
* Simulate data from NEGATIVE BINOMIAL distribution *
*****************************************************/
%let p=0.5;
%let k=10;
data negbinomial;
do i = 1 to 100000;
x = rand ('NEGBINOMIAL', &p, &k);
output;
end;
run;
proc print data=negbinomial (obs=10);
title "Negative Binomial Data with p = &p, k = &k";
run;
proc univariate data=negbinomial;
var x;
histogram;
run;
Gamma
The gamma distribution is a non-negative, continuous, two-parameter probability distribution. The pdf of the gamma distribution using shape and scale parameters is:
The gamma function is:
/************************
* GAMMA PDF *
*************************/
%let a=3;
%let lambda=1;
data gammapdf;
do x = 0 to 15 by 0.01;
y = pdf("GAMMA", x, &a, &lambda);
output;
end;
run;
proc sgplot data=gammapdf noautolegend;
title "Gamma Probability Density Function (a=&a, lambda=&lambda)";
series x=x y=y;
xaxis grid label="k"values = (0 to 15 by 5);
yaxis display=(nolabel) values = (0 to .4 by .1);
refline 0 / axis=y;
run;
/******************************************
* Simulate data from GAMMA distribution *
******************************************/
%let a=2;
%let lambda=2;
data Gamma;
do i = 1 to 100000;
x = rand ('GAMMA', &a, &lambda);
output;
end;
run;
proc print data=poisson (obs=10);
title "Gamma Data with a = &a, lambda = &lambda";
run;
proc univariate data=gamma;
var x;
histogram;
run;
For More Information
Find more articles from SAS Global Enablement and Learning here.
Thank you to Rick Wicklin for the correction to the negative binomial pdf code.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.