08-04-2013 01:07 AM
I need to use zero-truncated negative binomial regression to analyze a panel data. Is there a way to include fixed-effect for models like these? Thanks -
The panel data is similar to library book finder system's webvisit log. See the following table. For a given time period, I can calculate for each guest, how many times he or she browsed for the books by a given author. For example, I can calculate guest_A accessed the library book main page for author2 for 4 times during 2008Q1. This would be one observation in my regression. The reason why I was trying to use zero-truncated count model is because the computer system won't even register a visit history until a guest actually visited the library book finder system. I don't have the whole population, because a guest can visit other library or they are not interested in reading books at all. Therefore, the minimum count is 1.
There are close to 1 mil obs in my sample. I also included a proc freq output in the table below.
08-05-2013 09:55 AM
If by zero truncated you mean there are no counts equal to zero by design, then I think what you might have to do is log transform the counts before using PROC PANEL. I don't see a way to specify a distribution in the documentation.
However, I may be jumping the gun. What are the panel data? There may be a way to specify a model in PROC QLIM that could address this.
08-05-2013 04:41 PM
Thank you for your suggestion. I updated some information on my question above. Could you share some of the insights you might have for dealing with this situation? I hope to include fixed effects into my model in order to mitigate omitted variable bias.
08-06-2013 09:49 AM
I come from a different statistical background, so you will have to help me out some. Why is this considered panel data? I may be missing something in the proc freq output--is this ignoring time periods? Is the presence of time as a factor what makes this panel data? If so, then a repeated measures in time using TCOUNTREG might be an option, as might GLIMMIX, using a gamma distribution.
Sounds like QLIM is not the way to go, if time is a factor, and there is correlation in the counts between time intervals for the subjects. It just doesn't address repeated measures well (yet).
08-06-2013 01:49 PM
Many thanks for your valuable suggestions. I calculated the count myself as below and use the count as a measure of readers' interest in the work of a particular author. table 1 is the first table in this thread.
year = year(date);
qtr = qtr(date);
yearqtr = compress(compress(year) || 'Q' || compress(qtr));
/* Calculate the number of times a guest was search for books by a particular author (i don't care about what kind of books at this point) */
create table table3 as
select distinct guest_id, yearqtr, author_id, count(*) as count
group by guest_id, yearqtr, author_id
order by guest_id, yearqtr, author_id;
/* yearqtr ignored for brevity*/
proc freq data=table3;
table count ;
Sorry about this confusion. I checked out TCOUNTREG manual for 9.4, it doesn't seem to cover model of fixed effects for panel count data in which 0s are truncated through.... Could you expand a little bit on using Gamma distribution in this situation? Thanks -
08-07-2013 01:47 PM
Well, the gamma distribution is just a continuous distribution with a skew, which is what your data shows. Your data would be an almost perfect fit to an exponential distribution if the count variable was replaced by count-1, and the exponential is a special case of the gamma. But you are lucky in a sense, because GLIMMIX requires a link of some kind between the original scale and the model scale, and for a gamma this is a log function. So you are defined on all parts.
The following fits a fixed repeated effect of timeperiod with a response of count, summarized by timeperiod and guest_id.
proc glimmix data=yourdatasummarizedbytimeperiodandguest_id;
class timeperiod guest_id;
model count = timeperiod / dist=gamma; /*this could be dist=negbin for a negative binomial*/
random timeperiod/subject=guest_id type=cs;/*I am going with CS here, only because I don't know the time spacing, or much about the process generating the data*/
Message was edited by: Steve Denham