This post was conceived as a companion piece for evaluating my previous post (A month of Mercy nerf in numbers). Other than that, it’s not really going to be about Mercy (though I use her to work out examples). It’s about everyone’s other favorite subject - math. More specifically, I want to address the question: how much trust should we place in Overbuff statistics?
I’ve seen a lot of people claim that Overbuff’s numbers are meaningless due to private profiles (or due to the fact that even before private profiles Overbuff only received statistics from a subset of the player base). This claim is false. You absolutely can make very good claims about a large data set by sampling a large piece of it at random.
Here’s an example. A very big bag is full of blue tokens and red tokens. The bag is shaken thoroughly to mix up the tokens. Then you spill 100,000 token on the floor. Of these, 60,000 are blue and 40,000 are red. The remaining tokens stay in the bag (in the Overwatch context, these are the “private” tokens). What percent of the tokens in the bag are blue?
You can’t know for sure without looking at every single token. What you can do is make claims, and then give probabilities that they’re true. The bag should contain about 60% blue tokens and 40% red tokens. If you claim that the bag has exactly 60% blue tokens, then you are probably wrong. If you claim that the bag has somewhere between 59%-61% blue tokens then you are almost surely right (with a probability that can be calculated and shown to be very very close to 100%). The more tokens you draw from the bag, the closer the percentage should be to the actual percentage in the whole bag. In math, we call this the law of large numbers.
You end up replacing a certain claim (exactly 60%) with a more nuanced one, together with a measure of how sure you are about it. This shouldn’t worry you. Your entire life is lived like this. For instance, your brain processes information from your eyes at a certain frame rate, and your eyes themselves have many missing pixels (so there are “private” pixels and “private” moments in time). Everything you see at all times is based off of probability. That’s the way almost all of human experience works.
The situation in Overwatch is almost identical, except we’re not sampling entirely at random. For instance, we’re only sampling people with public profiles. The good news is that there’s excellent reason (supported by data analysis) to believe that private profiles aren’t correlated with, say, skill at playing Roadhog. So the conclusions still hold.
So, now on to the main question that I wanted to discuss. How much should you trust Overbuff data? Like in the bag of tokens example, it really depends on how many games Overbuff records. If you drew 2 tokens out of the bag, your conclusions would be off. If you drew a trillion, they would be really good. So, how many games does Overbuff record?
We can’t know exactly without someone from Overbuff chiming in, but we can get a very good estimate. The thing that comes into play here, and will be important for the discussion later on, is a beautiful theorem called the central limit theorem. I won’t state the theorem itself, since it’s a bit too technical for a forum post, but it deals exactly with the kind of situation we have here - repeating a random event over and over again and trying to understand its averages. The central limit theorem is the source of the bell curves that you might have seen pop up in all kinds of aspects of your life.
Suppose you record N games in which Mercy was played (I told you there would be a little bit of Mercy here). Suppose that Mercy is balanced such that she should win about b% of the time (and b% is pretty close to 50% so the standard deviation is close to 1/sqrt(2)), but that in those games she won about a%. How far apart can we reasonably expect a% and b% to be? One thing that the central limit theorem (and its companion, the Berry-Esseen theorem) says is that you should expect about 66% percent chance for a% and b% to be within about 100/sqrt(1.4 N) of each other (i.e within one standard deviation).
So I looked at the time period between late June (when private profiles were introduced) and early August (when the support balance patch came in) and found several characters whose pick rates were stable (Mercy was one of them). For each day, I found out how far the win rates for that day were from the average win rate during that time period. Then I found the percentage that 66% of the results were smaller than and solved for N.
Now, this is only an estimation but the upshot of it is that Overbuff records around 100,000 games a day. Using Blizzard’s statistics about player distribution (yes, I know that player distribution isn’t the same as the distribution of the number of games per day, but they’re close) we get about: 8,000 games in Bronze, 21,000 in silver, 32,000 in gold, 25,000 in platinum, 10,000 in diamond, 3,000 in masters, and 1,000 in gm every day.
Now back to the central limit theorem. Say we are trying to understand Mercy’s win rates in all ranks. Mercy’s pick rate in all ranks is currently 7.32% so she appears in 7.32% * 6 * 100,000 = 43,920 games. The central limit theorem tells us that we have about a 66% confidence that her average is within one standard deviation of the actual average. This means that we have a 66% level of confidence that the average in overbuff is at most 0.4% away from the actual average. It also gives about a 90% confidence that the average is at most 0.8% away. That’s about the level of trust we should have for a daily win rate in all ranks in Overbuff.
In diamond for instance, we get a 66% chance that her daily averages are within 1.4% of the correct average and a 90% confidence that they’re within 2.8%. That’s not amazing. But, we can improve it by quite a bit. The first thing we can do is to limit the direction - we have a 84% confidence that the correct average is not MORE than 1.4% above the overbuff average for that day (in this case I don’t really care about it being less) and a 98% chance that the correct average is not more than 2.8% from the Overwatch average.
Better, but still not great. So, what do we do? We take more games. We calculate the average win rate over a whole week (if you want to be accurate, you should weight each day by the pick rate). This ends up being 7 times as many games so the accuracy improves. When you do this, you get a 84% confidence that the actual average is not more than 0.5% from the Overbuff weekly average in diamond, and a 98% confidence that it’s not more than 1% more than it. And if this accuracy isn’t enough, you can average over a whole month. This will give you an 84% confidence that you’re not more than 0.25% too low, and a 98% confidence that you’re not more than 0.5% too low. Not bad. In all ranks, the monthly average has about a 84% chance of not being more than 0.073% too low, and a 98% of not being more than 0.146% too low.
In Masters and gm, you really shouldn’t be paying attention at all to daily numbers - only to weekly and monthly averages. Especially for characters with low pick rates. Mercy’s pick rate in gm is about 2%. That means it only measures about 100 Mercy games a day. That means that you only have a 66% confidence to be within 8.3% of the correct number (or a 84% chance to not be more than 8.3% too low). That’s really bad. However, if you average over a week then you are 84% confidence you’re not more than 3% too low, and if you average over a month then you’re 84% confident you’re not more than 1.5% too low. Similarly, in Masters if you average over a month you’re 84% confident that you’re not more than 0.85% too low. Note: you have to be careful when averaging over a month. The days with higher pick rates should count more than the days with lower pick rates because they represent more games. And if this level of confidence isn’t enough? Wait another month and average. This will divide your 84% and 98% confidence numbers by about 1.4.
The numbers given here are just guidelines. They’re based off of estimates and approximations. But the math behind them is very real. They’re good estimates and guidelines. There are other places to go from here that can increase the confidence levels even further. For instance, if the character you’re interested in spends more day below a certain number than above it, that increases your confidence that the number is low. Or, If the character you’re interested in spends most of their days very close to a certain number, that increases the likelihood that this number is close to the correct average.
I hope this helps people interpret data in future discussions (and not only Mercy based ones).