Hello,
I am new to SAS and I am having trouble figuring out the difference between proc glimmix and proc panel in the context of panel data analysis. I have an unbalanced panel (CUSTOMER_ID + TIME_PERIOD) and I want to capture unobserved heterogeneity at customer-level, using a random effects model.
PROC GLIMMIX: I am specifcying a VC type covariance matrix between different customers. That is, I expect observations from the same customer to be correlated across time periods but not across different customers. All customers in addition have an idiosyncratic error component.
PROC GLIMMIX DATA=BK.DATA1;
CLASS VAR1 CUSTOMER_ID;
MODEL Y = VAR1 VAR2 VAR3 VAR1*VAR3
/LINK = IDENTITY DIST = NORMAL SOLUTION;
RANDOM INTERCEPT/ SUBJECT = CUSTOMER_ID TYPE= VC V;
RUN;
PROC PANEL: For the same covariates as in the above model, I am running in to trouble. For example, I do not get the Hausman test result and multicollinearity problems.
PROC PANEL DATA=BK.DATA1;
CLASS VAR1 ;
MODEL Y = VAR1 VAR2 VAR3 VAR1*VAR3
/RANONE VCOMP = FB;
ID CUSTOMER_ID TIME_PERIOD;
RUN;
I have the following questions:
1. What are the differences in the modeling assumptions between using proc glimmix and proc panel? Aren't they the same (as GLS/FGLS) for a linear model with a single random effect? If so, shouldn't they give identical estimates? I understand that proc glimmix uses GEE, but specifying normal distribution should be the same as GLS, right?
2. Is proc glimmix "better than" proc panel in some sense? Or do they both do very different things (in the case of panel data) and may be, I am completely missing something here.
I would highly appreciate your comments.
BK
Actually, the two procedures are giving you the "same" results, with slight variations because of different estimation algorithms. The apparent difference is due to the way the two procedures parameterize the class variable. In GLIMMIX, the parameter for the last level of the class variable (1 here) is forced to be 0. In PANEL, the parameter for the first level of the class variable (0 here) is forced to be 0. This reverses sign for the class parameter, and changes the intercept.
Try running GLIMMIX by removing dep_var1 from the class statement.
CLASS CUSTOMER_ID time_period;
MODEL dep_var2 = ind_VAR1 ind_VAR2 dep_var1
/LINK = IDENTITY DIST = NORMAL SOLUTION;
RANDOM INTERCEPT/ SUBJECT = CUSTOMER_ID TYPE= VC;
You now get the same results as with PANEL. This can only be done when there are two levels to the class variable, coded as 0 and 1.
Note that the random effect variance is 0. That also is contributing to the equivalence of the two methods.
I want to start by saying that I do over 90% of my work in PROC GLIMMIX, so I would say I have a prejudice.
In your PROC GLIMMIX code, I can't seem to identify which variable is your time variable. That is critical to modeling what is going on--you will likely need another RANDOM statement in order to capture the within subject correlation over time. When we have that in hand, I think I can come up with code to address your issue.
Steve Denham
Hi Steve,
Thank you for your response. I did not have any random or fixed effects for the time variable, hence did not include them in GLIMMIX. My model specification is:
Y_ij = B_0 + B_1 *VAR_ij + ...+ B_4 *VAR1VAR3_ij + v_i + e_ij,
where v_i is the random effect for subject i and e_ij is the common error term. PROC PANEL requires me to specify both cross-section and time period variables regardless, hence I had to include them in that code.
Since I only have random effects for the subject, does it not automatically introduce correlation over time for the same subject? In the above specification, the error terms for a customer across time periods are correlated because of v_i terms.
Also, can you please comment on how GLIMMIX differ from PANEL in terms of the above analysis?
Thanks!
BK
For GLIMMIX to model the within subject effects, you must specify a time period variable, and if you expect it to have a differential effect on the levels of the other variables, you must also include an interaction effect. Given that, I think the following approximates what you trying to do:
PROC GLIMMIX DATA=BK.DATA1;
CLASS VAR1 TIME CUSTOMER_ID;
MODEL Y = VAR1 VAR2 VAR3 VAR1*VAR3 TIME TIME*VAR1 TIME*VAR2 TIME*VAR3 TIME*VAR1*VAR3
/LINK = IDENTITY DIST = NORMAL SOLUTION;
RANDOM INTERCEPT/ SUBJECT = CUSTOMER_ID TYPE= VC V;
RANDOM TIME/SUBJECT = CUSTOMER_ID TYPE=AR(1);
RUN;
This models an autoregressive error structure over time within each customer. Other error structures may be more appropriate, depending on the data generating process and the spacing in time of the measurements.
Thus, cross-sectional comparisons can be done at each time interval, while the other effects can be roughly translated as "intercepts" when TIME=0.
Steve Denham
In GLIMMIX or MIXED, you need a random or repeated statement to duplicate PROC PANEL. The subject effect is really there with PANEL (implicitly), whether you think it is there or not. This dated article may help you:
http://www2.sas.com/proceedings/forum2007/170-2007.pdf
Ivm, thanks, that is definitely helpful. The following code (taken from that paper) implements two-random effects in PROC MIXED.
proc mixed data=two method=type3;
class i t;
model y = x1 x2 x3 /solution;
random i t;
run;
In my case I am only trying to model subject specific unobserved heterogeneity (i.e., one-way random effects). In which case do you think I still need to state "class t" and "random t"? This related to Steve's suggestion above. My understanding is that having a "t" terms introduces covariance across different subjects in the same period, which I am trying to avoid.
Thanks,
BK
Steve, the correlation I am trying to capture is that: the error terms across time periods for the same subject have the identical covariances (which arises from have subject-specific random effect). My understanding is that autoregressive structure captures decaying (decreasing as time between two periods increases) covariance in the idiosyncratic errors and not subject-specific random effects.
Given the above, do you think I would still need the CLASS TIME and RANDOM TIME statements?
Thanks,
BK
A random i; term would give you compound symmetry for the different times within each subject (i). That means equal correlation. I think that is what you want.
@lvm, I think I want to disagree. I think that ignoring the repeated nature, implemented by t in this example, is the same as throwing all of the measurements for an individual into one large bucket, such that a "panel" inference would be impossible. It would just be a one-way analysis. Maybe I am missing something.
Steve Denham
FOr normal distribution, the random i statement is giving compound symmetry within subjects (i), since there is also a residual by default. All observation pairs within i have the same correlation. Just like in a RCBD. Of course, to get a structure to the correlation, other statements would be needed.
Thanks, @lvm, that makes sense. it just doesn't fit my preconceived notion of what panel data looks like, so I plead guilty to carrying my prejudices into the analysis.
Steve Denham
I am not recommending this model, per se, just showing a model that could be used.
I am getting very different outputs using PROC GLIMMIX and PROC PANEL with the same panel data (although I thought the below two specifications are statistically equivalent). I agree they are using different estimation techniques (PANEL uses FGLS and GLIMMIX uses GEE) but I think that is not the reason for the discrepency.
Data test;
Title 'sample_customer';
Input customer_id time_period dep_var1 dep_var2 ind_var1 ind_var2;
datalines;
1 1 1 12 10 .9
1 2 1 15 7 8.3
1 3 1 8.9 8 2.3
1 4 0 0 6 2
1 5 0 0 6 5
1 6 1 19 3 4
1 7 1 4 4 3
2 1 1 12 10 5
2 2 0 0 7 3
2 3 0 0 8 3
2 4 0 0 6 2
3 1 1 40 20 10
3 2 1 24 17 19
3 3 0 0 18 2.3
3 4 0 0 16 12
3 5 0 0 26 35
3 6 0 0 33 24
3 7 0 0 24 13
3 8 0 0 12 31
3 9 1 42 36 18;
PROC GLIMMIX DATA=test;
CLASS CUSTOMER_ID dep_var1;
MODEL dep_var2 = ind_VAR1 ind_VAR2 dep_var1
/LINK = IDENTITY DIST = NORMAL SOLUTION;
RANDOM INTERCEPT/ SUBJECT = CUSTOMER_ID TYPE= VC;
RUN;
PROC PANEL DATA=test;
CLASS dep_var1;
MODEL dep_var2 = ind_VAR1 ind_VAR2 dep_var1
/RANONE VCOMP = FB;
ID CUSTOMER_ID TIME_PERIOD;
RUN;
Thanks,
BK
Actually, GLIMMIX uses REML (restricted ML) for normal data, such as yours, not GEE. It uses GEE when there is a Poisson or binomial and one properly sets up a residual structure.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.