### Tuesday, March 07, 2006

## Re: st: R-squared in panel data models

Hi statalist and hi Bill!

I had the same problem as Ahmed (in 2003...) and your answer was extremly helpfull. Just one question remains: I want to refer to the fact that "In the -xtreg, fe- calculation, we are washing out the explanatory effects of the intercepts" in my paper, but actually I would prefer to have a reference for that. Do you or does somebody know whom I can cite?

Greetings Nina

Ahmed Diesel <ahmed.diesel@gmx.de> asked,

> Why is R-squared in panel data models always very low (so that everybody is > happy about an R-squared of 10%)? I don't find any explanation about that > in the literature.

Ahmed has asked a very deep question and one deserving of an answer.

The first answer, of course, is what is acceptable depends on your science, but some people, hearing that answer, may think that means standards are lower in some sciences than others, so some scientists get away with things that other scientists couldn't dream of doing.

That would be a misinterpreation of the short answer. For different sciences, what is acceptable can vary because of the nature of the problem itself. It all depends on where the "noise" in the data resides, as I will explain. In one science, a result with a "poor" R-squared may in fact contain much more information than, in another science, a result with an R-squared near one.

I suspect Ahmed is an economist so I am going to answer using an economic example. Well, that's only half the reason. I was trained as an economist. Anyway, it is easy enough to recast my answer to other sciences and it is rather fun to do that, because what turns out to be important and unimportant can change.

The argument below has two parts:

1. (Substantive) It should not surprise you that if you compare Ahmed's earnings with his own earnings at different times, you can explain much of the variation with a few varables. If you compare Ahmed's earnings to, say, Nick Cox's earnings, those same few variables will explain less.

2. (Calculation) The R-squared reported with panel-dataset models is cross-sectional like rather than time-series like; it can be compared with R-squareds from cross-sectional regressions but cannot, without adjustment, be compared to R-squareds from time-series models.

Cross-sectional economic data -----------------------------

A classic problem is annual earnings as a function of eductional attainment,

age, and labor-market experience:

ln(earnings_i) = a + b*ed_i + c1*age_i + c2*(age_i)^2 + d1*exp_i + d2*(exp_i)^2 + u_i

Now clearly, there are thousands of other things that that affect the level of earnings other than educational attainment, age, and labor-market experience, and those things will vary from person to person in the data. All those things are wrapped together in the residual, along with pure luck:

u_i = e*Z_i + pureluck_i

You should not be surprised when this simple model does a poor job, absolutely speaking, at explaining the level of earnings. Consider the subset of the data, persons with ed=16 (college graduate), age 35, and all having worked 14 years; you know you will see considerable variation in their earnings.

That, however, is not a criticism of the model. It may turn out that we accurately estimate a large effect for educational attainment. That would be useful information: we may not be able to explain across people the overall level of earnings very accurately, but we might very accurately be able to measure the effect of education.

Time-series data ----------------

Now let's consider the same problem but this time, use time-series data to estimate it. What we are going to do is take one person from our cross sectional dataset (say the first person), and collect data over time, and then estimate:

ln(earnings_1t) = a + b*ed_1t + c1*age_1t + c2*(age_1t)^2 + d1*exp_1t + d2*(exp_1t)^2 + u_1t

I assert that, if you do this, you will find that you can explain the variation in earnings very well: R-squared will be high. The reason for that is that, this time, it will be the coefficient "a" rather than than the residual u which will include the 1,000s of variables that we did not measure.

As a technical note, let me say that in the formulas we use to calculate R-squared, we do not really assign any explanatory power to the intercept, but that is misleading, because neither do we, in the data, ever observe any variation across person -- there is only one person. Thus, the net result is as if we did assign explanatory power to the intercept in the sense of cross-person variation.

I'll give you the math, but before that, just think about it. You take Ahmed Diesel and collect his earnings over time. Now you set about "explaining" his earnings. Ahmed's average earnings, by itself, will provide lots of explanatory power. Indeed, over a short enough period, Ahmed's average earnings might be constant, in which case we would have an R-squared

of 1.

The math --------

Let us now do the math. I will tell you that ln(earnings_it), for any person i in the world, at any time t, is given by

ln(earnings_it) = a + b*S_i (things about the person) + c*S_t (things about the time) + d*S_it (things about the person and time) + e_it (a little noise)

Let me tell you that this model is very complete: I have talked not only to economists, but psychologists, epidemiologists, and even physicists. In fact, it was not until I talked to the physicists and they told me about quantum effects that I had any randomness in the model at all. This model has everything.

The problem with this model is that I have no hope of measuring most of the variables contained in S_i, S_t, and S_it.

Still, I set about estimating this model. First, I will use cross-sectional data. I will use data for t=2002. The first thing that happens to this model is that I lose all variation in time, so let me recollect terms:

ln(earnings_i,2002) = a + c*S_2002 (intercept) + c*S_i + d*S_i,2002 (things about person) + e_i,2002 (noise)

Understand what just happened here: for t=2002, S_t = S_2002 is just a set of values that do not vary, so c*S_2002 becomes a single constant value. Let me write the above:

Ln(earnings_i) = a + c*S_2002 intercept + c' * T_i T_i = (S_i, S_i,2002) + noise_i

what I next do is divide T_i into that which I can measure and that which I cannot:

T_i = (M_i, U_i)

and that leads to

Ln(earnings_i) = a + c*S_2002 <- intercept + c1 * M_i + noise_i + c2*U_i <- resulting residual

and there is the model I can estimate. The residual contains every person-specific thing I cannot measure, and the intercept contains every time-specfic thing (measurable or not). (Economists: I have swept under the rug issues of the correlation of variables in the model; this is not important for explaining R-squared.)

Now let's do the time-series model. I start with the same model,

ln(earnings_it) = a + b*S_i (things about the person) + c*S_t (things about the time) + d*S_it (things about the person and time) + e_it (a little noise)

and this time I set i=1 and end up with

Ln(earnings_t) = a + c*S_1 <- intercept + c1 * M_t + noise_t + c2*U_t <- resulting residual

This time, the intercept contains all every person-specific things (measurable or not), and the residual contains every time-specific thing I cannot measure.

Panel datasets --------------

You can do the math for a panel dataset yourself. It's rather fun, but you end up with lots of terms as you divide each separate piece into observable and nonobservable.

Regardless, panel datasets are just a combination of cross-sectional and time-series datasets, and so you should expect the reported explanatory power of panel datasets to lie in between.

In fact, however, there is one more thing you need to know: when we calculate the explanatory power, we assign no explanatory power to the individual intercepts. This is no different from usual. What is different from the time-series case is that we do have variation across person, so we are in effect reporting a "cross-sectional like" R-squared. The reported R-squared

can be compared with R-squareds from cross sectional models, but not with time-series models.

To better understand this detail, try the following experiment: run a fixed-effects model that has just a few fixed effects. First run it using -xtreg, fe-. Write down the R-squareds. Now run it using linear regression, creating the dummy variables for each of the persons for yourself. All results will be the same, except that the reported R-squared will be much higher. In the -xtreg, fe- calculation, we are washing out the explanatory effects of the intercepts. If you just run it using linear regression, those explantory effects are not removed.

-- Bill wgould@stata.com

-- Nina Karstens Department of Food Economics & Consumption Studies University of Kiel * * For searches and help try: * http://www.stata.com/support/faqs/res/findit.html * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

Tag: statalist