It's Sunday night - time for the winddown of the weekend before the week starts up again.
I was reviewing a paper for a journal tonight. I really liked the idea behind the paper and they had a great dataset. But they really botched the analysis. The reason for it is not that hard to understand, so let me try to explain it. I can't tell you about the paper, so let me choose an imaginary example: the number of parking tickets written on a given day.
If someone asked you, "Does the number of parking tickets written on a given day in the average American town vary by day of week?", you'd look both directions and wonder where the hidden camera was. (Does anybody remember Alan Fundt and Candid Camera?) If you got beyond that, you might think: I could get this info for a number of towns and pool the info. But the towns differ in size, so how can I pool information over towns that somehow takes account of the size of the town? I know, why don't I calculate for each town and each day the expected number of parking tickets for that day. For example, assume Bigtown had 313 days on which tickets were written and 31,300 issued tickets that year. You'd expect 100/day. If they had 150 on the fourth of July, that would be what percent higher than expected? (OK: (150 - 100)/100 = 50% above expected.) That makes sense.
Now, does it make sense to pool these relative percentages over towns? You wouldn't want to pool just the numbers, because then Morris, IL, (population 12,000) would be dwarfed next to Chicago, Milwaukee, etc. So, you might hope using relative proportions would make sense.
But here's the problem: what if Podunk only issues on average 1 ticket a day. (Imagine Podunk as being rather like Mayberry, USA, on Andy Griffith.) On July 4 they might issue 50 to all those out of town people who don't know how to behave in a small town. This would be a 4900% increase in one day! Do you think that should carry the same weight as a change from 100 a day 150 a day in Bigcity or more like the difference between 100 a day and 5000 a day?
It's probably safe to assume that you'd think there is a difference between 50 on the fourth of July in Podunk and 5000 in Bigcity (or even 149 in Bigcity). Relative proportions just aren't a useful way in combining information across cities that differ widely in size.
The paper I was reviewing was not about parking tickets, but about deaths. So, we need to be both careful and respectful in the way we analyze the data. We need to combine in an appropriate fashion so that the conclusions we draw reflect the variability in the data. The people who wrote the paper were calculating these intermediate statistics (relative proportions) and then treating them as if they were infinitely accurate - and ignoring the fact that knowing about the deaths of a few people is not as valuable as knowing about the deaths of hundreds or thousands of people.
For those who are curious, Poisson regression would be a better (and more honest) way to combine information across cities. The estimates produced in that fashion would reflect the fact that larger counts (and larger denominators!) give us more information.
Recent Comments