After the Pitfalls in statistics: averages, medians and distributions module, below are some more of the common pitfalls in statistics.
In November 2020, Danish researchers published the results of a study on the effects of mask wearing on the spread of covid-19. The researchers had recruited 5.000 volunteers and divided these people into 2 group: 2.500 people would wear masks whenever they would move in public spaces, and the other 2.500, the control group, would wear no masks.
After a period of 1 month, all participants in the study were tested for covid-19. Among the people who didn’t wear masks 2,1 percent tested positive, while 1,8 percent of the mask wearers tested positive.
“A very small effect, wearing masks only reduces infections by 0,3 percent!” people against the wearing of masks argued. But others argued that the reduction in infections was in fact a 14 percent reduction, a much higher number that confused a lot of people.
https://twitter.com/ClarkeMicah/status/1339653223926423553
So what happened?
The people behind the 0,3 percent number calculated the difference between two percentages (2,1% and 1,8%) and mistakenly called it a percentage difference. But in order to calculate a percentage difference, you need to take the difference of the initial value and the new value, and make that difference relative to the initial value:
$$ percentageChange = (newValue - initialValue)/initialValue $$
Putting in the values from the study, the effect of the wearing of masks can be calculated:
$$ percentageChange = (2,1 - 1,8)/2,1 = 14\% $$
So the effect of mask wearing the researchers had found, was indeed 14 percent. The simple difference between the two percentages is not called a percentage, but a percentage point difference. So you could either say that the study had measured an effect of 0,3 percentage points, or an effect of 14 percent.
In the end, it didn’t really matter, as the study was inconclusive given the sample size and the measured effect. But in order to avoid any misunderstanding about the size of an effect or of a change in values, make sure to call the result of a simple subtraction of percentages a difference in percentage points.
Consider the following table, showing the final energy consumption in the top 5 EU member states in 2020:
Rank | Country | Electricity consumption (Gigawatt-hour) |
---|---|---|
1 | Germany | 490.054 |
2 | France | 420.356 |
3 | Italy | 283.814 |
4 | Spain | 227.172 |
5 | Poland | 148.241 |
Should we conclude from this table that the Germans are the biggest electricity consumers in Europe? In absolute numbers, yes. But it if you know a thing or two about European demography, you will notice that this top 5 of electricity consumers is also the top 5 in population:
Rank | Country | Electricity consumption (Gigawatt-hour) | Population |
---|---|---|---|
1 | Germany | 490.054 | 83.166.711 |
2 | France | 420.356 | 67.320.216 |
3 | Italy | 283.814 | 59.641.488 |
4 | Spain | 227.172 | 47.332.614 |
5 | Poland | 148.241 | 37.958.138 |
This makes sense, of course: more inhabitants will consume more energy. But in order to compare the energy consumption between countries, population should be factored in, and the numbers should be divided by the population of each country.
The top 5 per capita electricity consumers in the EU looks completely different than the one with absolute numbers: