I have a large panel data that contains daily stock returns and would like to run some fixed effect regressions with clustering at stock and month level.
I used PROC SURVEYREG as below but got a "ERROR: Integer overflow on computing amount of memory required". This is probably due to too many stocks. What would be another more efficient method to tackle this?
proc surveyreg data=prices;
clustered stock date;
class stock;
model y=x1 x2 x3 x4 x5 x6 stock/solution;
run;
PROC SURVEYREG will have memory issues when there are a large number of cluster levels and the clusters are also included in the model. There are several procedures that handle clustered standard errors, and SURVEYREG might not be the best one for this model.
You could account clustering in a modeling procedure such as PROC MIXED (for normal response data) or PROC GENMOD (for binomial, count, and other response distributions). PROC MIXED adjusts the standard errors for the fixed effects when you have a RANDOM statement in the model.
In the one-way case, say you have correlated data of firm-year observations, and you want to control for fixed effects at the year and industry level but compute clustered standard errors clustered at the firm level (could be firm, school, etc.). The PROC MIXED code would be
proc mixed empirical;
class firm;
model y = x1 x2 x3 / solution;
random int / subject=firm;
run;
With the MIXED code, you are estimating the model
Yij = U + B1*X1 + B2*X2 + Fi + Eij
where the Fi's are the firm effects, assumed to come from a normal distribution with mean 0 and variance SIGMA_FIRM**2, and the Eij's are the residuals. MIXED estimates the SIGMA_FIRM**2 and the variance of the residuals, using those estimates to adjust the standard errors of the regressors.
You can further adjust the mixed model using the EMPIRICAL option on the PROC MIXED statement to get empirical-based standard errors for the fixed effects.
The MODEL statement is for the fixed effects and the RANDOM statement is for the random effects. PROC MIXED computes the estimates and standard errors for fixed effects using functions of the V matrix, which is the variance-covariance matrix of y. This matrix depends on the random effect specification and the repeated statement specification. So the standard errors for fixed effects have already taken into account the random effects in this model, and therefore accounted for the clusters in the data.
If you have data from a complex survey design with cluster sampling then you could use the CLUSTER statement in PROC SURVEYREG. PROC SURVEYREG uses design-based methodology, instead of the model-based methods used in the traditional analysis procedures. Survey researchers are typically not interested in modeling the clusters and estimating parameters related to them. If you do have survey data but one of your goals is to estimate main effects or interactions involving clusters, you could omit the CLUSTER statement in this procedure. Keep in mind that the parameter estimates themselves are unaffected by the survey design, only their standard errors are, so this will not affect the estimates. If you do not have survey data then PROC MIXED is the better choice to use for fixed effects with clustered standard errors.
If you have panel data, you might find what you want in PROC PANEL. PROC PANEL is designed for panel data models, and it provides the HCCME= option to specify heteroscedasticity correction on the standard errors and the CLUSTER option to adjust standard errors for clustering. At this time, the CLUSTER option is only supported when certain HCCME correction is specified, that is, when HCCME =0, 1, 2, or 3. So if you want heteroscedasticity robust standard errors and also correction for clustering for the variance covariance matrix, then you can use the HCCME = 0, 1, 2, 3 option together with the CLUSTER option in MODEL statement using PROC PANEL. The MODEL statement syntax section of PROC PANEL documentation discusses this functionality. Scroll down to:
HCCME= NO | number
specifies the type of HCCME variance-covariance matrix.
If you specify HCCME=NO, the variance-covariance matrix is not corrected. The value number can be any integer from 0 to 4, inclusive. By default, HCCME=NO. The formulas are defined in SAS/ETS(R) User's Guide >The PANEL Procedure > Details > Heteroscedasticity-Corrected Covariance Matrices.
The reference for the formula for the cluster adjustment is Wooldridge 2002, p152:
Wooldridge, J. M. (2002), Econometric Analysis of Cross Section and Panel Data, Cambridge, MA: MIT Press.
So a syntax example where FIRM and TIME are fixed effects, specifying the cluster adjustment to the variance-covariance matrix with HCCME is:
proc panel data=airline;
id firm time;
model y=x1 x2 x3 /fixtwo hccme = 2 cluster;
run;
The heteroscedasticity-consistent covariance matrix estimator (HCCME) was enhanced by adding the CLUSTER option for the plain sandwich form (HC0), the degrees-of-freedom-adjusted form (HC1), and two types of leverage-adjusted estimators (HC2 and HC3). The CLUSTER option enables you to calculate a cluster-corrected covariance matrix and provides cluster-adjusted standard errors for parameter estimates. PROC PANEL assumes the cross sections are correlated and that there is no autocorrelation in the time series.
Other than the above mentioned cluster adjusted heteroscedasticity-corrected covariance matrix, we do not have other cluster adjusted standard errors on panel data models.
PROC SURVEYREG will have memory issues when there are a large number of cluster levels and the clusters are also included in the model. There are several procedures that handle clustered standard errors, and SURVEYREG might not be the best one for this model.
You could account clustering in a modeling procedure such as PROC MIXED (for normal response data) or PROC GENMOD (for binomial, count, and other response distributions). PROC MIXED adjusts the standard errors for the fixed effects when you have a RANDOM statement in the model.
In the one-way case, say you have correlated data of firm-year observations, and you want to control for fixed effects at the year and industry level but compute clustered standard errors clustered at the firm level (could be firm, school, etc.). The PROC MIXED code would be
proc mixed empirical;
class firm;
model y = x1 x2 x3 / solution;
random int / subject=firm;
run;
With the MIXED code, you are estimating the model
Yij = U + B1*X1 + B2*X2 + Fi + Eij
where the Fi's are the firm effects, assumed to come from a normal distribution with mean 0 and variance SIGMA_FIRM**2, and the Eij's are the residuals. MIXED estimates the SIGMA_FIRM**2 and the variance of the residuals, using those estimates to adjust the standard errors of the regressors.
You can further adjust the mixed model using the EMPIRICAL option on the PROC MIXED statement to get empirical-based standard errors for the fixed effects.
The MODEL statement is for the fixed effects and the RANDOM statement is for the random effects. PROC MIXED computes the estimates and standard errors for fixed effects using functions of the V matrix, which is the variance-covariance matrix of y. This matrix depends on the random effect specification and the repeated statement specification. So the standard errors for fixed effects have already taken into account the random effects in this model, and therefore accounted for the clusters in the data.
If you have data from a complex survey design with cluster sampling then you could use the CLUSTER statement in PROC SURVEYREG. PROC SURVEYREG uses design-based methodology, instead of the model-based methods used in the traditional analysis procedures. Survey researchers are typically not interested in modeling the clusters and estimating parameters related to them. If you do have survey data but one of your goals is to estimate main effects or interactions involving clusters, you could omit the CLUSTER statement in this procedure. Keep in mind that the parameter estimates themselves are unaffected by the survey design, only their standard errors are, so this will not affect the estimates. If you do not have survey data then PROC MIXED is the better choice to use for fixed effects with clustered standard errors.
If you have panel data, you might find what you want in PROC PANEL. PROC PANEL is designed for panel data models, and it provides the HCCME= option to specify heteroscedasticity correction on the standard errors and the CLUSTER option to adjust standard errors for clustering. At this time, the CLUSTER option is only supported when certain HCCME correction is specified, that is, when HCCME =0, 1, 2, or 3. So if you want heteroscedasticity robust standard errors and also correction for clustering for the variance covariance matrix, then you can use the HCCME = 0, 1, 2, 3 option together with the CLUSTER option in MODEL statement using PROC PANEL. The MODEL statement syntax section of PROC PANEL documentation discusses this functionality. Scroll down to:
HCCME= NO | number
specifies the type of HCCME variance-covariance matrix.
If you specify HCCME=NO, the variance-covariance matrix is not corrected. The value number can be any integer from 0 to 4, inclusive. By default, HCCME=NO. The formulas are defined in SAS/ETS(R) User's Guide >The PANEL Procedure > Details > Heteroscedasticity-Corrected Covariance Matrices.
The reference for the formula for the cluster adjustment is Wooldridge 2002, p152:
Wooldridge, J. M. (2002), Econometric Analysis of Cross Section and Panel Data, Cambridge, MA: MIT Press.
So a syntax example where FIRM and TIME are fixed effects, specifying the cluster adjustment to the variance-covariance matrix with HCCME is:
proc panel data=airline;
id firm time;
model y=x1 x2 x3 /fixtwo hccme = 2 cluster;
run;
The heteroscedasticity-consistent covariance matrix estimator (HCCME) was enhanced by adding the CLUSTER option for the plain sandwich form (HC0), the degrees-of-freedom-adjusted form (HC1), and two types of leverage-adjusted estimators (HC2 and HC3). The CLUSTER option enables you to calculate a cluster-corrected covariance matrix and provides cluster-adjusted standard errors for parameter estimates. PROC PANEL assumes the cross sections are correlated and that there is no autocorrelation in the time series.
Other than the above mentioned cluster adjusted heteroscedasticity-corrected covariance matrix, we do not have other cluster adjusted standard errors on panel data models.
Dear Zard,
Thanks for your comprehensive explanation. But I think when we use "Proc Mixed", it cannot absorb fixed effects by firm and fix cluster standard errors by date. It absorbs fixed effects and fix cluster standard errors of the same variable. Do you know any other methods than "Proc Surveyreg" which enables to do that?
You can have a look at my macro which combines multi-way fixed effect, and multi-way clustering and IV estimates
http://olivier.godechot.free.fr/hoparticle.php?id_art=721
@Zard Your reply is detailed and very helpful. I need your guidance regarding the case where Y is a dichotomous variable (1,0), and clustered standard errors are required at firm level with fixed effect at industry level. Your help will be much appreciated.
Hi @Zard
I have a panel data set (firm-year), and I would like to include firm and year fixed effects and cluster standard errors at firm level.
I was trying PROC MIXED which you suggested, but in my case, I need to run a two stage regression, so I tried to output a data set where predicted estimates are included, so that I can regress on the predicted value in the second regression. However, I cannot find a statement that specifies outputting a data set with the predicted values in PROC MIXED.
As there aren't a large number of cluster levels in my data set, I've also tried PROC SURVEYREG. My code is as follows:
proc surveyreg data=panel_data;
class firm year;
cluster firm;
model early_refin = turn_call /*early_refin and turn_call are dummy variables*/
asset
leverage
firm
year
/ adjrsq solution;
output out=firststage p=predicted_value;
run;
proc surveyreg data=firststage;
class firm year;
cluster firm;
model elimat = predicted_value /*elimat is a dummy variable*/
asset
leverage
firm
year
/ adjrsq solution;
run;
The PROC SURVEYREG code above runs well, but the result was totally different from the one I got using R. I can see that the algorithm differs from software to software, but I was wondering if this PROC SURVEYREG code did the same thing as I expected it to do.
Could you give me some advice on how to perform my case (2SLS & firm and year fixed effects & cluster standard error at firm level) or how to adjust the code above to match my request?
Thank you in advance.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.