Hi,
I am unsure if hpgenselect can be applied when target is continuous and has beta distribution. I do not want to use Beta Regression, does any other approach work if not hpgenselect ?
Kind Regards
SK
Unfortunately, this procedure cannot handle the beta distribution. As an approximation, you could use PROC GLMSELECT. You could use the weight statement to account for unequal variances for Y.
Or you can use proc hpnlmod. The beta distribution is quite simple, so you can specify the likelihood inside hpnlmod, and use the "general" likelihood in the model statement.
Here a simple example of how you can find the log-likelihood estimates of the two parameters if all data are beta-distributed with same parameters. I think the example easily can be extended to situations where there are some covariates in the data.
data simulation;
do i=1 to 1000;
y=rand('beta',2,3);
sqy=y**2;
output;
end;
run;
*start values are found by the moment method. Therefore, mean of y and y^2 are calculated.;
proc means data=simulation mean ;
var y sqy;
output out=startvalues mean=y sqy;
run;
data _NULL_;
set startvalues;
a=y*(y-sqy)/(sqy-y**2);
b=(y-1)*(sqy-y)/(sqy-y**2);
put a= b=;
call symput('starta',put(a,best.));
call symput('startb',put(b,best.));
run;
*here the likelihood estimates will be found;
*The moment estimators from above are used as starting values;
proc hpnlmod data=simulation;
parm a &starta. b &startb.;
ll=(a-1)*log(y)+(b-1)*log(1-y)-logbeta(a,b);
model i~general(ll);
run;
I like JacobSimonsen's approach.
@JacobSimonsen, could you share why you decided to go with PROC HPNLMOD? I would have chosen PROC NLMIXED, like this:
proc nlmixed data=simulation;
parms a &starta. b &startb.;
bounds 0 < a,b;
ll=(a-1)*log(y)+(b-1)*log(1-y)-logbeta(a,b);
model y ~ general(ll);
run;
@Siddharth123, if you want to see additional examples formulating models as MLE problems and using SAS procedures (such as NLMIXED) to solve, see
My simple rule of thumb of whether I should choose PROC HPNLMOD or PROC NLMIXED is that if I have random effects then I use NLMIXED and otherwise HPNLMOD. That is simple because HPNLMOD in general is faster. In this case I have no strong opinion of which of these two procedure that should be used. Why would you choose NLMIXED?
I agree that it is wise to have the boundary option.
I find it a bit funny that when the "general" likelihood is used, then it doesnt matter what variable that is on the left side of "~". Both NLMIXED and HPNLMOD require a variable there.
You can fit a beta model using PROC GLIMMIX or PROC FMM. See the DIST=BETA option in the MODEL statement. See this example of using the beta distribution in GLIMMIX to model a continuous proportion response.
As others have correctly pointed out, there are a few ways to fit models to data with a beta distribution. GLIMMIX is the easiest way. However, since the original question dealt with HPGENSELECT, one would assume that they were trying to do variable selection from a large number of potential predictor variables. That cannot be done in an automated way with GLIMMIX or NLMIXED.
One should always be careful with the beta distribution: it is defined for 0 < y < 1. This means that all values of y equal to 0 or 1 will become missing values in GLIMMIX. My experience is that datasets with continuous proportions usually have 0s and 1s.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.