Lapis Lazuli | Level 10

## determine the distribution for a sample

I would like to find a distribution that best fit the sample of a variable. the distribution could be normal, gamma, exponential, or log-normal etc. Is there a way to tell SAS to find the distribution and provide the parameters ?

8 REPLIES 8
Opal | Level 21

## Re: determine the distribution for a sample

To my knowledge there is no automatic procedure. But you can pit the distributions against each other by fitting them to your data as a mixture with proc FMM

``````proc fmm data=sashelp.heart plots=none componentinfo gconv=0;
model cholesterol = / dist=normal      label="Normal";
model cholesterol = / dist=lognormal   label="Lognormal";
model cholesterol = / dist=gamma       label="Gamma";
model cholesterol = / dist=exponential label="Exponential";
run;
``````
```                      Mixing                    Standard
Component    Probability    GLogit(Prob)       Error    z Value    Pr > |z|

1              0        -5.97E13           0        .         .
2         0.9897          6.0324      0.4371      13.80      <.0001
3         0.0079          1.2062      0.4639       2.60      0.0093
4         0.0024               0
```
PG
SAS Super FREQ

## Re: determine the distribution for a sample

PROC SEVERITY in SAS/ETS fit many distributions and uses statistical criteria (AIC, BIC, etc) to identify the best fitting distribution. See the Getting Started example in the SEVERITY documentation.

Lapis Lazuli | Level 10

## Re: determine the distribution for a sample

Thanks. but I got MEMORY error when I use my data. and also, it does not work with variable with negative value

SAS Super FREQ

## Re: determine the distribution for a sample

Please post the portion of the SAS log that shows the error.

Lapis Lazuli | Level 10

## Re: determine the distribution for a sample

Error 1:

``````355      proc severity data=sample2 crit=aicc;
NOTE: Writing HTML Body file: sashtml.htm
356     loss indicativefee_mean;
357     dist _predefined_;
358  run;

ERROR: Java virtual machine exception. java.lang.OutOfMemoryError: GC overhead limit
exceeded.
ERROR: Java virtual machine exception. java.lang.OutOfMemoryError: GC overhead limit
exceeded.
NOTE: The SAS System stopped processing this step because of errors.
NOTE: There were 1972257 observations read from the data set
WORK._DOCTMP000000000000000000001.
NOTE: PROCEDURE SEVERITY used (Total process time):
real time           1:26.26
cpu time            1:02.89``````

Error 2:

``````360 proc severity data=sample2 crit=aicc;
361 loss lnindicativefee;
362 dist _predefined_;
363 run;

WARNING: For at least one observation, variable lnindicativefee has a negative value.
Ignoring such observations.
WARNING: No valid observations found.
NOTE: PROCEDURE SEVERITY used (Total process time):
real time 0.58 seconds
cpu time 0.51 seconds``````

SAS Super FREQ

## Re: determine the distribution for a sample

I do not know what is causing the Java error, but try using PLOTS=NONE to suppress plots.

Regarding the WARNINGS,

``````WARNING: For at least one observation, variable lnindicativefee has a negative value.
Ignoring such observations.
WARNING: No valid observations found.``````

The warning says that all of the observations are invalid for one of the distributions that you are fitting. Instead of using the _PREDEFINED_ keyword, specify the distributions individually (for example,  DIST Exponential). That will restrict the procedure to only the distributions of interest. You can also use PRINT=ALL to find out more information about each fit.

Remember that several of these distributions have restrictions on the value of the observations. For example, negative values are invalid for the exponential distribution. Similar restrictions apply for the lognormal and gamma distributions.

SAS Super FREQ

## Re: determine the distribution for a sample

The simpler, nonmodeling approach is using PROC UNIVARIATE. See this note on distribution testing and parameter estimation.

Lapis Lazuli | Level 10

## Re: determine the distribution for a sample

Does this mean I have to try the parameters to see which one fits best ?

Discussion stats
• 8 replies
• 6000 views
• 4 likes
• 4 in conversation