BookmarkSubscribeRSS Feed

Use SAS to Quickly Simulate and Graph Data from Different Distributions

Started ‎10-23-2023 by
Modified ‎10-27-2023 by
Views 989

Why Distributions are Important

 

Why are the distributions of your data and your errors so important? Understanding different distributions is key to understanding your data and helping you know how to treat the data and what models to use. SAS Studio lets you graph different probability distribution functions (pdfs). SAS also enables you to quickly create large data sets drawn from different distributions. This post shows you how.

 

Many models rely on assumptions. For example, ordinary least squares regression makes assumptions, such as:

 

  • Normality: Error terms follow a normal distribution, with mean 0
  • IID errors: Error terms are independent and identically distributed
  • Homoscedasticity: Equal variance
  • Linearity (in the parameters)

 

01_BE_image001.png

 Select any image to see a larger version.

Mobile users: To view the images, select the "Full" version at the bottom of the page.

 

If you violate the assumptions, your results will be incorrect. For example, your confidence limits will be wrong. In real life, data and errors usually stray from the normal distribution. Sometimes a large number of observations (high n) will help you out, and you will be close enough to normal that your results will be adequate and your models will still provide reliable results. However, sometimes your data and errors follow a very different distribution. For example, count data commonly follow a Poisson distribution. Data that have been drawn from two separate populations may follow a bi-modal distribution.

 

So depending on the distribution you discover, you may select a different model. Or you may wish to transform the data prior to modeling. Understanding the distribution is key to understand your data and selecting your best modeling options.

 

Common Distributions

 

  • Normal (Gaussian)
  • Poisson
  • Negative Binomial
  • Gamma

 

Let’s see how to create pdfs and simulate data for each of these four distributions.

 

Normal (Gaussian)

 

For a normal distribution, the probability density function forms the bell-shaped curve. The mean, median, and mode are all the same for a standard normal distribution. Skewness and kurtosis would both equal zero for a standard normal distribution. Data (and errors) are rarely perfectly normally distributed, but for a large number of observations, they may approximate a normal distribution.

 

We can easily plot the probability density function of a normal distribution using the SAS PDF function for the normal distribution. Notice that the mean may be represented, by μ, θ, etc. depending on the source. Likewise, the variance may be represented by σ2 or λ2 etc.

 

02_BE_image003.png

 

Using the default location parameter mean equal to 0 and the default scale parameter standard deviation equal to 1, we get the graph shown below with the following code. Notice that if your standard deviation equals 1 then your variance also equals 1, because the variance is the square of the standard deviation.

 

/************************
* NORMAL (GAUSSIAN) PDF *
*************************/

data normalpdf;
do x = -3 to 3 by 0.1;
y = pdf("Normal", x);
output;
end;
run;

proc sgplot data=normalpdf noautolegend;
title "Normal Probability Density Function";
series x=x y=y;
xaxis grid label="x";
yaxis display=(nolabel) values = (0 to .5 by .1);
refline 0 / axis=y;
run;

 

03_BE_image005.png

 

There are different ways to simulate data from specific distributions using SAS. One way is to use SAS/IML with the RANDGEN subroutine to generate random values. The second, and easier way, is to use a simple DATA step with the RAND function. This is the method we will use here.

  • DATA step to simulated data from univariate and uncorrelated multivariate distributions
  • RAND function to generate random values from standard univariate distributions including:

 

The DATA step is excellent for simulating univariate data, but SAS/IML is more powerful for simulating multivariate data.

 


/****************************************************
* Simulate data from NORMAL (GAUSSIAN) distribution *
*****************************************************/

data normal;
do i = 1 to 100000;
x = rand("Normal");
output;
end;
run;

title "Normal Distribution";
proc print data=normal (obs=10);
run;

proc univariate data=normal;
var x;
histogram;
run;

 

04_BE_image007.png

 

In most cases , however, we do not have normality. We may have other distributions such as Poisson, negative binomial, Gamma, etc. Let’s look at these three distributions.

 

04a_BE_image009-300x200.jpg

Poisson

 

The Poisson distribution models discrete non-negative numbers, usually count data. Examples include the number of fish caught by a waterman in a day, the number births per year in a city, or the number of Barbie dolls purchased per day in a store. The Poisson distribution has one parameter, the mean number of events, commonly represented as λ. Remember that the Poisson distribution is different from the exponential distribution. The exponential distribution describes waiting time between events, whereas the Poisson distribution describes the number of times the event occurs within a period.

 

Let’s plot the probability density function of a normal distribution using the SAS PDF function for the normal distribution.

 

05_BE_image011.png

 

We’ll use this code for the following graph.

 

/**************
* POISSON PDF *
***************/

%let lambda=3;

data poissonpdf;
do k = 0 to 10;
y = pdf("Poisson", k, &lambda);
output;
end;
run;

proc sgplot data=poissonpdf noautolegend;
title "Poisson Probability Density Function for lambda = &lambda";
series x=k y=y;
xaxis grid label="k";
yaxis display=(nolabel) values = (0 to .4 by .1);
refline 0 / axis=y;
run;

 

06_BE_image013.png

 

We can use the RAND function to create data from the Poisson distribution.

 

/******************************************
* Simulate data from POISSON distribution *
******************************************/

%let lambda=2;

data Poisson;
do i = 1 to 100000;
x = rand ('POISSON', &lambda);
output;
end;
run;

proc print data=poisson (obs=10);
title "Poisson with lambda = &lambda";
run;

proc univariate data=poisson;
var x;
histogram;
run;

 

07_BE_image015.png

 

By simply changing our %let macro, we can change lambda to 4 or 16 or whatever we would like. For example,

 

%let lambda=4;

 

08_BE_image017.png

 

09_BE_image019.png

 

10_BE_image021.jpg

Binomial and Negative Binomial

 

The binomial and negative binomial distributions are for sequential independent Bernoulli trials with a binomial outcome, e.g., success or failure. Both distributions assume that the probability of success p is the same for every trial. The binomial distribution models the number of successes in n trials. The negative binomial distribution models the number of failures before k successes occur.

 

The classic example is repeated coin tosses. With a fair coin, the probability of heads versus tails is 50/50. That means that p = 0.5. Each toss is independent, so each toss has a 50% chance of heads and a 50% chance of tails.

 

In our code, we use k instead of n to create the pdf.

 

11_BE_image023.png

 

/************************
* NEGATIVE BINOMIAL PDF *
*************************/

%let k=5;
%let p=0.5;

data negbinomialpdf;
do m = 0 to 20;
y = pdf("NEGBINOMIAL", m, &p, &k);
output;
end;
run;

proc sgplot data=negbinomialpdf noautolegend;
title "Negative Binomial Probability Density Function (m=&m, p=&p)";
series x=m y=y;
xaxis grid label="m"values = (0 to 20 by 5);
yaxis display=(nolabel) values = (0 to .2 by .05);
refline 0 / axis=y;
run;

 

imageNegBinomialwithk=5.png

 

 

And we can again use the RAND function to simulate data, this time from the negative binomial distribution.

 

/****************************************************
* Simulate data from NEGATIVE BINOMIAL distribution *
*****************************************************/

%let p=0.5;
%let k=10;

data negbinomial;
do i = 1 to 100000;
x = rand ('NEGBINOMIAL', &p, &k);
output;
end;
run;

proc print data=negbinomial (obs=10);
title "Negative Binomial Data with p = &p, k = &k";
run;

proc univariate data=negbinomial;
var x;
histogram;
run;

 

13_BE_image027.png

 

Gamma

 

The gamma distribution is a non-negative, continuous, two-parameter probability distribution. The pdf of the gamma distribution using shape and scale parameters is:

 

14_BE_image029.png

 

The gamma function is:

 

15_BE_image031.png

 

/************************
* GAMMA PDF *
*************************/

%let a=3;
%let lambda=1;

data gammapdf;
do x = 0 to 15 by 0.01;
y = pdf("GAMMA", x, &a, &lambda);
output;
end;
run;

proc sgplot data=gammapdf noautolegend;
title "Gamma Probability Density Function (a=&a, lambda=&lambda)";
series x=x y=y;
xaxis grid label="k"values = (0 to 15 by 5);
yaxis display=(nolabel) values = (0 to .4 by .1);
refline 0 / axis=y;
run;

 

16_BE_image033.png

 

/******************************************
* Simulate data from GAMMA distribution *
******************************************/

%let a=2;
%let lambda=2;

data Gamma;
do i = 1 to 100000;
x = rand ('GAMMA', &a, &lambda);
output;
end;
run;

proc print data=poisson (obs=10);
title "Gamma Data with a = &a, lambda = &lambda";
run;

proc univariate data=gamma;
var x;
histogram;
run;

 

17_BE_image035.png

 

18_BE_image037.png

 

For More Information

 

Find more articles from SAS Global Enablement and Learning here.

Comments

Thank you to Rick Wicklin for the correction to the negative binomial pdf code.

Version history
Last update:
‎10-27-2023 02:34 PM
Updated by:
Contributors

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

Free course: Data Literacy Essentials

Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning  and boost your career prospects.

Get Started

Article Tags