Programming the statistical procedures from SAS

Model selection when dependent variables consists of zeros

Reply
Contributor
Posts: 39

Model selection when dependent variables consists of zeros

Hi,

I am dealing with this problem where my dependent variable is continuous but consisted of several zeros (about 25%). The purpose of my study is out of sample prediction so I would expect several predicted values to be zeros as well. I understand that I cannot use count model since my dependent variable is continuous. OLS is a possibility ,but in this case OLS is giving low predictions but hardly any which can be considered zero. I tried GLM too with tweedie distribution nad link=log, this also gives no predictions close to zeros as I would expect. However, I ran a tobit model with lower bound censored at zero, and it gave me a mean value which is very close to the observed mean value. Tobit also generated zero predictions, but it predicted zeros for about 68% cases, which is very high.

Next, I am going to estimate a hurdle regression but I would appreciate any suggestions for an alternative model that might be better suited.

Thanks in advance.

-CD

Respected Advisor
Posts: 2,655

Re: Model selection when dependent variables consists of zeros

If you haven't investigated PROC FMM (finite mixture models), you might want to look at that, especially the examples.  In particular, the prescreening of the data with PROC KDE might open up some other ideas.

Steve Denham

Contributor
Posts: 39

Re: Model selection when dependent variables consists of zeros

Thanks. I will look into that.

Contributor
Posts: 39

Re: Model selection when dependent variables consists of zeros

Hi Steve,

As per your suggestion, I have been experimenting with Proc FMM. I looked through the 130-page SAS document on FMM procedure and few other document, but I am still confused about few things. Most of the examples out there are on count data. As I have mentioned earlier, the response variable in my data is continuous but has several zeros. I think what I am trying to do is, mixing distribution logit (for zero and not zero part) and lognormal (for the positive part). This is what I am doing:

(For the second model statement I tried both dist=constant and dist=binary. With binary I don't get any zero predictions which I would normally expect. Not sure if I am doing this part wrong or the prediction part wrong. )

proc fmm data= datafile ;

model x =y1 y2 y3/noint dist=lognormal;

model x= /dist=constant;

probmodel y1 y2 y3 ;

output out=fmm predicted residual;

Thank you very much.

Respected Advisor
Posts: 2,655

Re: Model selection when dependent variables consists of zeros

I hadn't even considered the dist=constant--that's clever, and it makes it look more like a hurdle model, which would fit the process better, I think.

Steve Denham

Ask a Question
Discussion stats
  • 4 replies
  • 480 views
  • 0 likes
  • 2 in conversation