### Wednesday, March 01, 2006

## st: Chi-square goodness of fit with grouped counts

This is actually a statistical question, rather than a programming one. I have data by markets (e.g. LA to NYC, CHI to LA, etc.) for numbers of flights cancelled in a given time period. For example, in market A there may be 400 cancellations, in market B 1327 cancellations, and so on. (I also have the total number of scheduled flights for each market in that same time period.)

I am interested in analyzing whether there is any significant pattern in the distribution of cancellations across short versus medium versus long-distance markets. I'm thinking that I want to use a chi-square goodness of fit test, comparing an expected distribution of cancellations across these market categories with what is observed. The problem is, I don't have standard frequency data in that I don't have data on individual flights; I have the number of cancellations by market.

At first, I thought that I could add up the numbers of cancellations in all short-distance markets to get the observed number of short-distance flight cancellations, and do similarly for the medium and long-distance markets. However, something about this doesn't seem right, and I get huge chi-square statistics if I do the calculations this way.

Is there a way to use a chi-square goodness of fit test in this context, and, if so, how should I account for my actual number of observations being equal to the number of markets and not the number of scheduled flights?

Tag: statalist