BookmarkSubscribeRSS Feed
somebody
Lapis Lazuli | Level 10

I would like to find a distribution that best fit the sample of a variable. the distribution could be normal, gamma, exponential, or log-normal etc. Is there a way to tell SAS to find the distribution and provide the parameters ?

 

8 REPLIES 8
PGStats
Opal | Level 21

To my knowledge there is no automatic procedure. But you can pit the distributions against each other by fitting them to your data as a mixture with proc FMM

 

proc fmm data=sashelp.heart plots=none componentinfo gconv=0;
model cholesterol = / dist=normal      label="Normal";
model cholesterol = / dist=lognormal   label="Lognormal";
model cholesterol = / dist=gamma       label="Gamma";
model cholesterol = / dist=exponential label="Exponential";
run;
                      Mixing                    Standard
    Component    Probability    GLogit(Prob)       Error    z Value    Pr > |z|

            1              0        -5.97E13           0        .         .
            2         0.9897          6.0324      0.4371      13.80      <.0001
            3         0.0079          1.2062      0.4639       2.60      0.0093
            4         0.0024               0
PG
Rick_SAS
SAS Super FREQ

PROC SEVERITY in SAS/ETS fit many distributions and uses statistical criteria (AIC, BIC, etc) to identify the best fitting distribution. See the Getting Started example in the SEVERITY documentation.

somebody
Lapis Lazuli | Level 10

Thanks. but I got MEMORY error when I use my data. and also, it does not work with variable with negative value

Rick_SAS
SAS Super FREQ

Please post the portion of the SAS log that shows the error. 

somebody
Lapis Lazuli | Level 10

Error 1:

355      proc severity data=sample2 crit=aicc;
NOTE: Writing HTML Body file: sashtml.htm
356     loss indicativefee_mean;
357     dist _predefined_;
358  run;

ERROR: Java virtual machine exception. java.lang.OutOfMemoryError: GC overhead limit
       exceeded.
ERROR: Java virtual machine exception. java.lang.OutOfMemoryError: GC overhead limit
       exceeded.
NOTE: The SAS System stopped processing this step because of errors.
NOTE: There were 1972257 observations read from the data set
      WORK._DOCTMP000000000000000000001.
NOTE: PROCEDURE SEVERITY used (Total process time):
      real time           1:26.26
      cpu time            1:02.89

Error 2: 

360 proc severity data=sample2 crit=aicc;
361 loss lnindicativefee;
362 dist _predefined_;
363 run;

WARNING: For at least one observation, variable lnindicativefee has a negative value.
Ignoring such observations.
WARNING: No valid observations found.
NOTE: PROCEDURE SEVERITY used (Total process time):
real time 0.58 seconds
cpu time 0.51 seconds

 

Rick_SAS
SAS Super FREQ

I do not know what is causing the Java error, but try using PLOTS=NONE to suppress plots.

 

Regarding the WARNINGS, 

WARNING: For at least one observation, variable lnindicativefee has a negative value.
Ignoring such observations.
WARNING: No valid observations found.

The warning says that all of the observations are invalid for one of the distributions that you are fitting. Instead of using the _PREDEFINED_ keyword, specify the distributions individually (for example,  DIST Exponential). That will restrict the procedure to only the distributions of interest. You can also use PRINT=ALL to find out more information about each fit.

 

Remember that several of these distributions have restrictions on the value of the observations. For example, negative values are invalid for the exponential distribution. Similar restrictions apply for the lognormal and gamma distributions.

StatDave
SAS Super FREQ

The simpler, nonmodeling approach is using PROC UNIVARIATE. See this note on distribution testing and parameter estimation.

somebody
Lapis Lazuli | Level 10

Does this mean I have to try the parameters to see which one fits best ?

sas-innovate-2024.png

 

Secure your spot at the must-attend AI and analytics event of 2024: SAS Innovate 2024! Get ready for a jam-packed agenda featuring workshops, super demos, breakout sessions, roundtables, inspiring keynotes and incredible networking events.

 

Register by March 1 to snag the Early Bird rate of just $695! Don't miss out on this exclusive offer. 

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 8 replies
  • 6000 views
  • 4 likes
  • 4 in conversation