R doesn’t have a built-in method for calculating heteroskedastically-robust and clustered standard errors. Since this is a common need for us, I’ve put together a little library to provide these functions.
Download the library.
The file contains three functions:
get.coef.robust
: Calculate heteroskedastically-robust standard errors
get.coef.clust
: Calculate clustered standard errors
get.conf.clust
: Calculate confidence intervals for clustered standard errors
The arguments for the functions are below, and then an example is at the end.
get.coef.robust(mod, var=c(), estimator=”HC”)
Calculate heteroskedastically-robust standard errors
mod: the result of a call to lm()
e.g., lm(satell ~ width + factor(color))
var: a list of indexes to extract only certain coefficients (default: all)
e.g., 1:2 [to drop FE from above]
estimator: an estimator type passed to vcovHC (default: White's estimator)
Returns an object of type coeftest, with estimate, std. err, t and p values
get.coef.clust(mod, cluster, var=c())
Calculate clustered standard errors
mod: the result of a call to lm()
e.g., lm(satell ~ width + factor(color))
cluster: a cluster variable, with a length equal to the observation count
e.g., color
var: a list of indexes to extract only certain coefficients (default: all)
e.g., 1:2 [to drop FE from above]
Returns an object of type coeftest, with estimate, std. err, t and p values
get.conf.clust(mod, cluster, xx, alpha, var=c())
Calculate confidence intervals for clustered standard errors
mod: the result of a call to lm()
e.g., lm(satell ~ width + factor(color))
cluster: a cluster variable, with a length equal to the observation count
e.g., color
xx: new values for each of the variables (only needs to be as long as 'var')
e.g., seq(0, 50, length.out=100)
alpha: the level of confidence, as an error rate
e.g., .05 [for 95% confidence intervals]
var: a list of indexes to extract only certain coefficients (default: all)
e.g., 1:2 [to drop FE from above]
Returns a data.frame of yhat, lo, and hi
Example in Stata and R
The following example compare the results from Stata and R for an example from section 3.3.2 of “An Introduction to Categorical Data Analysis” by Alan Agresti. The data describes 173 female horseshoe crabs, and the number of male “satellites” that live near them.
Download the data as a CSV file. The columns are the crabs color (2-5), spine condition (1-3), carapace width (cm), number of satellite crabs, and weight (kg).
The code below calculates a simple model with crab color fixed-effects, relating width (which is thought to be an explanatory variable) and number of satellites. The first model is a straight OLS regression; then we add robust errors, and finally crab color clustered errors.
The analysis in Stata.
First, let’s do the analysis in Stata, to make sure that we can reproduce it in R.
import delim data.csv
xi: reg satell width i.color
xi: reg satell width i.color, vce(hc2)
xi: reg satell width i.color, vce(cluster color)
Here are excerpts of the results.
The OLS model:
------------------------------------------------------------------------------
satell | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
width | .4633951 .1117381 4.15 0.000 .2428033 .6839868
_cons | -8.412887 3.133074 -2.69 0.008 -14.59816 -2.227618
------------------------------------------------------------------------------
The robust errors:
------------------------------------------------------------------------------
| Robust HC2
satell | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
width | .4633951 .1028225 4.51 0.000 .2604045 .6663856
_cons | -8.412887 2.939015 -2.86 0.005 -14.21505 -2.610726
------------------------------------------------------------------------------
Note that we’ve used HC2 robust errors, to provide a direct comparison with the R results.
The clustered errors.
------------------------------------------------------------------------------
| Robust
satell | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
width | .4633951 .0466564 9.93 0.002 .3149135 .6118767
_cons | -8.412887 1.258169 -6.69 0.007 -12.41694 -4.408833
------------------------------------------------------------------------------
The robust and clustered errors are a little tighter for this example.
The analysis in R.
Here’s the equivalent code in R.
source("tools_reg.R")
tbl <- read.csv("data.csv")
mod <- lm(satell ~ width + factor(color), data=tbl)
summary(mod)
get.coef.robust(mod, estimator="HC2")
get.coef.clust(mod, tbl$color)
And excerpts of the results. The OLS model:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -8.4129 3.1331 -2.685 0.00798 **
width 0.4634 0.1117 4.147 5.33e-05 ***
The coefficient estimates will be the same for all three models. Stata gives us -8.412887 + .4633951 w
and R gives us -8.4129 + 0.4634 w
. These are identical, but with different reporting precision, as are the standard errors.
The robust errors:
(Intercept) -8.41289 2.93902 -2.8625 0.004739 **
width 0.46340 0.10282 4.5067 1.229e-05 ***
Again, we use HC2 robust errors. Stata provides 2.939015 and .1028225 for the errors, matching these exactly.
The clustered errors:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -8.412887 1.258169 -6.6866 3.235e-10 ***
width 0.463395 0.046656 9.9321 < 2.2e-16 ***
Stata gives the exact same values as R.