topic Re: hpgenselect for continuous target variable in Statistical Procedures

hpgenselect for continuous target variable

Siddharth123 — Wed, 05 Jul 2017 22:21:04 GMT

Hi,

I am unsure if hpgenselect can be applied when target is continuous and has beta distribution. I do not want to use Beta Regression, does any other approach work if not hpgenselect ?

Kind Regards

Re: hpgenselect for continuous target variable

lvm — Thu, 06 Jul 2017 01:41:13 GMT

Unfortunately, this procedure cannot handle the beta distribution. As an approximation, you could use PROC GLMSELECT. You could use the weight statement to account for unequal variances for Y.

Re: hpgenselect for continuous target variable

JacobSimonsen — Thu, 06 Jul 2017 08:16:43 GMT

Or you can use proc hpnlmod. The beta distribution is quite simple, so you can specify the likelihood inside hpnlmod, and use the "general" likelihood in the model statement.

Re: hpgenselect for continuous target variable

JacobSimonsen — Thu, 06 Jul 2017 11:46:02 GMT

Here a simple example of how you can find the log-likelihood estimates of the two parameters if all data are beta-distributed with same parameters. I think the example easily can be extended to situations where there are some covariates in the data.

data simulation;
  do i=1 to 1000;
    y=rand('beta',2,3);
	sqy=y**2;
	output;
  end;
run;

*start values are found by the moment method. Therefore, mean of y and y^2 are calculated.;
proc means data=simulation mean ;
  var y sqy;
  output out=startvalues mean=y sqy;
run;

data _NULL_;
  set startvalues;
  a=y*(y-sqy)/(sqy-y**2);
  b=(y-1)*(sqy-y)/(sqy-y**2);
  put a= b=;
  call symput('starta',put(a,best.));
  call symput('startb',put(b,best.));
run;

*here the likelihood estimates will be found; 
*The moment estimators from above are used as starting values;

proc hpnlmod data=simulation;
  parm a &starta. b &startb.;
  ll=(a-1)*log(y)+(b-1)*log(1-y)-logbeta(a,b);
  model i~general(ll);
run;

Re: hpgenselect for continuous target variable

Rick_SAS — Thu, 06 Jul 2017 12:29:44 GMT

I like JacobSimonsen's approach.

@JacobSimonsen, could you share why you decided to go with PROC HPNLMOD? I would have chosen PROC NLMIXED, like this:

proc nlmixed data=simulation;
  parms a &starta. b &startb.;
  bounds 0 < a,b;
  ll=(a-1)*log(y)+(b-1)*log(1-y)-logbeta(a,b);
  model y ~ general(ll);
run;

@Siddharth123, if you want to see additional examples formulating models as MLE problems and using SAS procedures (such as NLMIXED) to solve, see

Re: hpgenselect for continuous target variable

JacobSimonsen — Thu, 06 Jul 2017 12:41:48 GMT

My simple rule of thumb of whether I should choose PROC HPNLMOD or PROC NLMIXED is that if I have random effects then I use NLMIXED and otherwise HPNLMOD. That is simple because HPNLMOD in general is faster. In this case I have no strong opinion of which of these two procedure that should be used. Why would you choose NLMIXED?

I agree that it is wise to have the boundary option.

I find it a bit funny that when the "general" likelihood is used, then it doesnt matter what variable that is on the left side of "~". Both NLMIXED and HPNLMOD require a variable there.

Re: hpgenselect for continuous target variable

StatDave — Fri, 07 Jul 2017 13:59:49 GMT

You can fit a beta model using PROC GLIMMIX or PROC FMM. See the DIST=BETA option in the MODEL statement. See this example of using the beta distribution in GLIMMIX to model a continuous proportion response.

Re: hpgenselect for continuous target variable

lvm — Fri, 07 Jul 2017 14:31:50 GMT

As others have correctly pointed out, there are a few ways to fit models to data with a beta distribution. GLIMMIX is the easiest way. However, since the original question dealt with HPGENSELECT, one would assume that they were trying to do variable selection from a large number of potential predictor variables. That cannot be done in an automated way with GLIMMIX or NLMIXED.

One should always be careful with the beta distribution: it is defined for 0 < y < 1. This means that all values of y equal to 0 or 1 will become missing values in GLIMMIX. My experience is that datasets with continuous proportions usually have 0s and 1s.