Solved: How to fit my data to Gamma, Weibull and Lognomal distributions?

t75wez1 · Posted 11-02-2021 05:22 PM

Hello,

I want to use my SAS data set named “customers” below to fit Gamma, Weibull and Lognomal distributions.
The data set includes two variables day and # of customers who made the purchase on that day.
The attached you’ll find the desired result looks like I'm seeking.

Could you enlighten me?

I have the base/SAS 9.4 1M7 with ETS version so I can run “Proc Model” or “proc FMM” if needed.

Please let me know if I need to provide additional information.

Thanks very much for your help and insights in advance.

Regards,
Ethan

data customers;
input day count;
label day = 'Day'
count = '# of customers'
;
datalines;
1 3
2 2
3 1
4 19
5 138
6 116
7 62
8 30
9 49
10 32
11 30
12 30
13 12
14 10
15 7
16 12
17 9
18 15
19 5
20 12
21 4
22 9
23 10
24 2
25 5
26 8
27 4
28 6
29 3
30 7
31 4
;
run;

StatDave · Posted 11-21-2021 05:17 PM

Yeah, I guess that'll work for adding one other set of data.

View solution in original post

ballardw · Posted 11-02-2021 06:40 PM

Like this?

proc univariate data=customers;
   var day;
   freq count;
   histogram /  gamma weibull lognormal
   ;
run;

There is a very large economy sized hint in that output you attached that says "Proc Univariate". The Histogram statement of Proc Univariate is one of the basic tools for examining distributions.

The outcome you show is demonstrating distribution, not "fitting". Fit usually means to transform to match some distribution.

t75wez1 · Posted 11-02-2021 08:15 PM

Thanks for your reply along with the example.

You're correct.

I'm seeking

1)To identify which Gamma, Weibull and Lognomal distribution to match variable called "count" distribution most given the day as X variable.

2) Use "proc sgplot" to overlay "count" bars with results getting from 1).

Thanks for your helps.

Ethan

StatDave · Posted 11-02-2021 08:55 PM

Your response data in the COUNT variable is, as named, discrete count data and so is not continuous. So, it would not strictly be correct to use continuous distributions like those you mentioned. Distributions that are appropriate for discrete, count responses are the Poisson (if mean and variance are equal), negative binomial, generalized Poisson, or Conway-Maxwell Poisson. Nevertheless, you can assess the fit of various continuous distributions using PROC SEVERITY. See the example in the Getting Started section in the SEVERITY documentation. Also see this note about distribution fitting. These statements fit the continuous distributions you mentioned and others:

proc severity crit=aicc;
   loss count;
   dist _predefined_;
run;

These statements fit the discrete Poisson, negative binomial, generalized Poisson, and Conway-Maxwell Poisson distributions appropriate for count data:

proc genmod;
model count= / dist=p;
run;
proc genmod;
model count= / dist=negbin;
run;
proc fmm;
model count= / dist=genpoisson;
run;
proc countreg;
model count= / dist=compoisson;
run;

t75wez1 · Posted 11-03-2021 05:22 PM

Thank you! Very helpful.

In stead of fitting COUNT variable for discrete distribution, I can transform it to "percent_count" as continuous variable to fit continuous distribution.

below is the code I define "percent_count" variable. Then I can use it in PROC SEVERITY.

Ethan

proc sql noprint;
select sum(count) into :total from customers ;
quit;
%put &total.;
data customers; set customers;
percent_count=round(100*count/&total.,0.01);

run;

SteveDenham · Posted 11-05-2021 08:53 AM

Just be sure to use dist=binary and do not multiply by 100, as you need the variable to be in the closed interval [0,1].

SteveDenham

t75wez1 · Posted 11-08-2021 09:59 AM

Hi Steve,

Many thanks. It makes sense to me. But I don't know exactly how to implement it into SAS code.

Could you give the example to enlighten me?

Ethan

SteveDenham · Posted 11-08-2021 01:01 PM

See @StatDave 's reply. You can make it binomial like I offered, but with only 31 obs, neither of the count/discrete distributions mentioned will converge in distribution to a binomial distribution, so I wouldn't go that way.

SteveDenham

StatDave · Posted 11-08-2021 12:56 PM

I don't think dividing by the total count (which is also a random variable) changes the fact that the observed values are counts. A plot analogous to your plot of the histogram of the counts with overlaid theoretical continuous distributions is a plot of the count histogram overlaid with a histogram of a theoretical discrete distribution. Leveraging the methods shown in this note to estimate the parameters of a distribution and generate random values, this code fits a negative binomial distribution to the data, estimates its parameters, and plots a histogram of that theoretical distribution.

   proc genmod data=customers;
      model count = / dist=negbin;
      ods output parameterestimates=pe;
      run;
   proc transpose data=pe out=tpe;
      var estimate;
      id parameter;
      run;
   data tpe;
      set;
      NB_k = 1/dispersion;
      NB_p = 1/(1+exp(intercept)*dispersion);
      do i=1 to 1000; 
        RandomY = rand("negbinomial",nb_p,nb_k);
        output;
      end;
      run;
   data a; merge customers tpe; run;
   proc sgplot;
      histogram count / binstart=12.5 binwidth=25; 
      histogram randomy / binstart=12.5 binwidth=25 transparency=.5;
      xaxis max=200;
      run;

Similarly for the generalized Poisson:

   proc fmm data=customers;
      model count= / dist=genpoisson;
      ods output parameterestimates=pe;
      run;
   proc transpose data=pe out=tpe;
      id parameter; var estimate;
      run;
   data genpoi;
      set tpe;
      theta = exp(intercept)*exp(-scale); 
      eta = 1-exp(-scale);
      do i=1 to 1000;
        RanGP = rand("genpoisson",theta,eta);
        output;
      end;
      run;
   data a; merge customers genpoi; run;
   proc sgplot;
    histogram count / binstart=12.5 binwidth=25; 
    histogram rangp / binstart=12.5 binwidth=25 transparency=.5;
    xaxis max=200;
    run;

31 observations is a small number for evaluating the fits in these plots, but they both look reasonably good.

t75wez1 · Posted 11-18-2021 08:59 PM

Hi StatDave_SAS,

Thanks so much for your great help.

But my goal is to identify a probability model for the distribution of counts of random events that dictates the type of distributions I should expect to see something similar like below. If a regression model for count data fits well, I should be able to say that the # of customers at day 1,2,.. and 31 equals to the predicated values from Generalized Poisson respectively. I don't see I can accomplish my goal with the result created by your code.

Could you enlighten me here?

Thanks,

StatDave · Posted 11-18-2021 11:22 PM

You can certainly fit a model, using a discrete distribution, to count data and get predicted values. But your questions so far have suggested that all of the data, taken across all of the days, come from one single distribution. My previous code assumes that and estimates that single distribution. If you are now saying that each day represents a separate distribution (of the same theoretical type but with possibly different parameters), and if you want to use DAY as a categorical predictor in the model, then you would want more than one single data value in each day in order to avoid saturating the model resulting in perfect prediction. However, if you are willing to restrict the association of DAY on the mean (as linear, quadratic, ...) so that the model is not saturated then you could fit any of various types of models using various distributions. For example, this fits a negative binomial model to the data using a lower-order spline of DAY on the counts.

proc gampl data=customers; 
model count=spline(day) / dist=negbin;
id _all_;
output out=preds pred=p;
run;
proc sgplot;
scatter y=count x=day;
pbspline y=p x=day / nomarkers;
run;

t75wez1 · Posted 11-19-2021 02:22 PM

Truly appreciative of your guidance.

The count data fits better than my previous result by adopting your code below with minor change by replacing POISSON distribution.

The data set named "customers" is just for one particular month. I do have 5 different data sets for 5 different months with DAY as a categorical predictor(e.g .=1,2,3..., 30).

But I don't know how to "avoid saturating the model resulting in perfect prediction" by leveraging those 5 different data sets.

Could you shed some lights here?

proc gampl data=customers;
model count=spline(day) / dist=POISSON;
id _all_;
output out=preds pred=predicted;
run;

proc sgplot data=preds;
vbarparm category=day response=count /
barwidth=0.7
baselineattrs=(thickness=0);
series y=predicted x=day /markers
markerattrs=(size=8pt color=red);
run;

StatDave · Posted 11-20-2021 03:55 PM

Saturation and perfect prediction result from the model having as many parameters as there are data points. With the original data, that would happen if you used DAY in the CLASS and MODEL statements in, say, PROC GENMOD. It's avoided in any way that results in a model that has fewer parameters. Now, how to deal with that additional data again depends on what you want to assume. If you assume that the counts on a given day, regardless of the month, all come from the same distribution and different days can have different distributions of the same type but with differing parameters, then you can just concatenate the data from the other months to the original data and run the same code to fit and plot the model. Be sure to use the same day numbers (1, 2, ... , 31) for the data from each month. But I think the scatter plot for the observed values is then better than a bar plot so that you can see the variability.

t75wez1 · Posted 11-21-2021 02:24 PM

After concatenating the data from the other four months to the original data and run the same code to fit and plot the model, I'll get five different models. If I assume that the counts on a given day, regardless of the month, all come from the same distribution, should I 'average' the predicted results of five models as final outcome?

StatDave · Posted 11-21-2021 04:34 PM

No, you won't as long as you don't include a BY statement with a variable indicating the month. Otherwise, you will get one model that uses all of the data from all of the months. You just now have more than one observation for each day number. There is no need to do any averaging of the data across the months. Note that I said "concatenate", not "merge". That means that the data set will have many more observations and the COUNT variable will now have many more values for all of the monthly data. You should not have multiple count variables for the several months.

How to fit my data to Gamma, Weibull and Lognomal distributions?

Re: How to fit my data to Gamma, Weibull and Lognomal distributions?

Re: How to fit my data to Gamma, Weibull and Lognomal distributions?

Re: How to fit my data to Gamma, Weibull and Lognomal distributions?

Re: How to fit my data to Gamma, Weibull and Lognomal distributions?

Re: How to fit my data to Gamma, Weibull and Lognomal distributions?

Re: How to fit my data to Gamma, Weibull and Lognomal distributions?

Re: How to fit my data to Gamma, Weibull and Lognomal distributions?

Re: How to fit my data to Gamma, Weibull and Lognomal distributions?

Re: How to fit my data to Gamma, Weibull and Lognomal distributions?

Re: How to fit my data to Gamma, Weibull and Lognomal distributions?

Re: How to fit my data to Gamma, Weibull and Lognomal distributions?

Re: How to fit my data to Gamma, Weibull and Lognomal distributions?

Re: How to fit my data to Gamma, Weibull and Lognomal distributions?

Re: How to fit my data to Gamma, Weibull and Lognomal distributions?

Re: How to fit my data to Gamma, Weibull and Lognomal distributions?

SAS Innovate 2025: Call for Content