Programming the statistical procedures from SAS

hpgenselect for continuous target variable

Reply
Contributor
Posts: 57

hpgenselect for continuous target variable

Hi,

 

I am unsure if hpgenselect can be applied when target is continuous and has beta distribution. I do not want to use Beta Regression, does any other approach work if not hpgenselect ?

 

Kind Regards

SK

Valued Guide
Valued Guide
Posts: 684

Re: hpgenselect for continuous target variable

Posted in reply to Siddharth123

Unfortunately, this procedure cannot handle the beta distribution. As an approximation, you could use PROC GLMSELECT. You could use the weight statement to account for unequal variances for Y.

Super Contributor
Posts: 298

Re: hpgenselect for continuous target variable

Posted in reply to Siddharth123

Or you can use proc hpnlmod. The beta distribution is quite simple, so you can specify the likelihood inside hpnlmod, and use the "general" likelihood in the model statement.

Super Contributor
Posts: 298

Re: hpgenselect for continuous target variable

Posted in reply to JacobSimonsen

Here a simple example of how you can find the log-likelihood estimates of the two parameters if all data are beta-distributed with same parameters. I think the example easily can be extended to situations where there are some covariates in the data.

data simulation;
  do i=1 to 1000;
    y=rand('beta',2,3);
	sqy=y**2;
	output;
  end;
run;

*start values are found by the moment method. Therefore, mean of y and y^2 are calculated.;
proc means data=simulation mean ;
  var y sqy;
  output out=startvalues mean=y sqy;
run;

data _NULL_;
  set startvalues;
  a=y*(y-sqy)/(sqy-y**2);
  b=(y-1)*(sqy-y)/(sqy-y**2);
  put a= b=;
  call symput('starta',put(a,best.));
  call symput('startb',put(b,best.));
run;

*here the likelihood estimates will be found; 
*The moment estimators from above are used as starting values;

proc hpnlmod data=simulation;
  parm a &starta. b &startb.;
  ll=(a-1)*log(y)+(b-1)*log(1-y)-logbeta(a,b);
  model i~general(ll);
run;
SAS Super FREQ
Posts: 3,753

Re: hpgenselect for continuous target variable

Posted in reply to JacobSimonsen

I like JacobSimonsen's approach.

 

@JacobSimonsen, could you share why you decided to go with PROC HPNLMOD?  I would have chosen PROC NLMIXED, like this:

 

proc nlmixed data=simulation;
  parms a &starta. b &startb.;
  bounds 0 < a,b;
  ll=(a-1)*log(y)+(b-1)*log(1-y)-logbeta(a,b);
  model y ~ general(ll);
run;

@Siddharth123, if you want to see additional examples formulating models as MLE problems and using SAS procedures (such as NLMIXED) to solve, see

Super Contributor
Posts: 298

Re: hpgenselect for continuous target variable

My simple rule of thumb of whether I should choose PROC HPNLMOD or PROC NLMIXED is that if I have random effects then I use NLMIXED and otherwise HPNLMOD. That is simple because HPNLMOD in general is faster. In this case I have no strong opinion of which of these two procedure that should be used. Why would you choose NLMIXED?

 

I agree that it is wise to have the boundary option.

 

I find it a bit funny that when the "general" likelihood is used, then it doesnt matter what variable that is on the left side of "~". Both NLMIXED and HPNLMOD require a variable there.

SAS Employee
Posts: 282

Re: hpgenselect for continuous target variable

Posted in reply to Siddharth123

You can fit a beta model using PROC GLIMMIX or PROC FMM.  See the DIST=BETA option in the MODEL statement. See this example of using the beta distribution in GLIMMIX to model a continuous proportion response.

Valued Guide
Valued Guide
Posts: 684

Re: hpgenselect for continuous target variable

Posted in reply to StatDave_sas

As others have correctly pointed out, there are a few ways to fit models to data with a beta distribution. GLIMMIX is the easiest way. However, since the original question dealt with HPGENSELECT, one would assume that they were trying to do variable selection from a large number of potential predictor variables. That cannot be done in an automated way with GLIMMIX or NLMIXED.

 

One should always be careful with the beta distribution: it is defined for 0 < y < 1. This means that all values of y equal to 0 or 1 will become missing values in GLIMMIX. My experience is that datasets with continuous proportions usually have 0s and 1s.

Ask a Question
Discussion stats
  • 7 replies
  • 176 views
  • 4 likes
  • 5 in conversation