On Mercy and data

This post was conceived as a companion piece for evaluating my previous post (A month of Mercy nerf in numbers). Other than that, it’s not really going to be about Mercy (though I use her to work out examples). It’s about everyone’s other favorite subject - math. More specifically, I want to address the question: how much trust should we place in Overbuff statistics?

I’ve seen a lot of people claim that Overbuff’s numbers are meaningless due to private profiles (or due to the fact that even before private profiles Overbuff only received statistics from a subset of the player base). This claim is false. You absolutely can make very good claims about a large data set by sampling a large piece of it at random.

Here’s an example. A very big bag is full of blue tokens and red tokens. The bag is shaken thoroughly to mix up the tokens. Then you spill 100,000 token on the floor. Of these, 60,000 are blue and 40,000 are red. The remaining tokens stay in the bag (in the Overwatch context, these are the “private” tokens). What percent of the tokens in the bag are blue?

You can’t know for sure without looking at every single token. What you can do is make claims, and then give probabilities that they’re true. The bag should contain about 60% blue tokens and 40% red tokens. If you claim that the bag has exactly 60% blue tokens, then you are probably wrong. If you claim that the bag has somewhere between 59%-61% blue tokens then you are almost surely right (with a probability that can be calculated and shown to be very very close to 100%). The more tokens you draw from the bag, the closer the percentage should be to the actual percentage in the whole bag. In math, we call this the law of large numbers.

You end up replacing a certain claim (exactly 60%) with a more nuanced one, together with a measure of how sure you are about it. This shouldn’t worry you. Your entire life is lived like this. For instance, your brain processes information from your eyes at a certain frame rate, and your eyes themselves have many missing pixels (so there are “private” pixels and “private” moments in time). Everything you see at all times is based off of probability. That’s the way almost all of human experience works.

The situation in Overwatch is almost identical, except we’re not sampling entirely at random. For instance, we’re only sampling people with public profiles. The good news is that there’s excellent reason (supported by data analysis) to believe that private profiles aren’t correlated with, say, skill at playing Roadhog. So the conclusions still hold.

So, now on to the main question that I wanted to discuss. How much should you trust Overbuff data? Like in the bag of tokens example, it really depends on how many games Overbuff records. If you drew 2 tokens out of the bag, your conclusions would be off. If you drew a trillion, they would be really good. So, how many games does Overbuff record?

We can’t know exactly without someone from Overbuff chiming in, but we can get a very good estimate. The thing that comes into play here, and will be important for the discussion later on, is a beautiful theorem called the central limit theorem. I won’t state the theorem itself, since it’s a bit too technical for a forum post, but it deals exactly with the kind of situation we have here - repeating a random event over and over again and trying to understand its averages. The central limit theorem is the source of the bell curves that you might have seen pop up in all kinds of aspects of your life.

Suppose you record N games in which Mercy was played (I told you there would be a little bit of Mercy here). Suppose that Mercy is balanced such that she should win about b% of the time (and b% is pretty close to 50% so the standard deviation for one Mercy game is close to 1/2), but that in those games she won about a%. How far apart can we reasonably expect a% and b% to be? One thing that the central limit theorem (and its companion, the Berry-Esseen theorem) says is that you should expect about 66% percent chance for a% and b% to be within about 100/2 sqrt(N)% of each other (i.e within one standard deviation).

So I looked at the time period between late June (when private profiles were introduced) and early August (when the support balance patch came in) and found several characters whose pick rates were stable (Mercy was one of them). For each day, I found out how far the win rates for that day were from the average win rate during that time period. Then I found the percentage that 66% of the results were smaller than and solved for N.

Now, this is only an estimation but the upshot of it is that Overbuff
records around 80,000 games a day. Using Blizzard’s statistics about player distribution (yes, I know that player distribution isn’t the same as the distribution of the number of games per day, but they’re close) we get about: 6,400 games in Bronze, 16,800 in silver, 25,600 in gold, 20,000 in platinum, 8,000 in diamond, 2,400 in masters, and 800 in gm every day.

Now back to the central limit theorem. Say we are trying to understand Mercy’s win rates in all ranks. Mercy’s pick rate in all ranks is currently 7.32% so she appears in 7.32% * 6 * 80,000 = 35,136 games. The central limit theorem tells us that we have about a 66% confidence that her average is within one standard deviation of the actual average. This means that we have a 66% level of confidence that the average in overbuff is at most 0.26% away from the actual average. It also gives about a 90% confidence that the average is at most 0.52% away. That’s about the level of trust we should have for a daily win rate in all ranks in Overbuff.

In diamond for instance with a pick rate of about 4%, we get a 66% chance that her daily averages are within 1.14% of the correct average and a 90% confidence that they’re within 2.28%. That’s not amazing. But, we can improve it by quite a bit. The first thing we can do is to limit the direction - we have a 84% confidence that the correct average is not MORE than 1.14% above the overbuff average for that day (in this case I don’t really care about it being less) and a 98% chance that the correct average is not more than 2.28% from the Overwatch average.

Better, but still not great. So, what do we do? We take more games. We calculate the average win rate over a whole week (if you want to be accurate, you should weight each day by the pick rate). This ends up being 7 times as many games so the accuracy improves. When you do this, you get a 84% confidence that the actual average is not more than about 0.43% from the Overbuff weekly average in diamond, and a 98% confidence that it’s not more than 0.86% more than it. And if this accuracy isn’t enough, you can average over a whole month. This will give you an 84% confidence that you’re not more than 0.2% too low, and a 98% confidence that you’re not more than 0.4% too low. Not bad. In all ranks, the monthly average has about a 84% chance of not being more than 0.048% too low, and a 98% of not being more than 0.096% too low.

In Masters and gm, you really shouldn’t be paying attention at all to daily numbers - only to weekly and monthly averages. Especially for characters with low pick rates. Mercy’s pick rate in gm is about 2%. That means it only measures about 96 Mercy games a day. That means that you only have a 66% confidence to be within 5.1% of the correct number (or a 84% chance to not be more than 5.1% too low). That’s really bad. However, if you average over a week then you are 84% confidence you’re not more than 1.9% too low, and if you average over a month then you’re 84% confident you’re not more than 0.93% too low. Similarly, in Masters if you average over a month you’re 84% confident that you’re not more than 0.53% too low. Note: you have to be careful when averaging over a month. The days with higher pick rates should count more than the days with lower pick rates because they represent more games. And if this level of confidence isn’t enough? Wait another month and average. This will divide your 84% and 98% confidence numbers by about 2.

The numbers given here are just guidelines. They’re based off of estimates and approximations. But the math behind them is very real. They’re good estimates and guidelines. There are other places to go from here that can increase the confidence levels even further. For instance, if the character you’re interested in spends more day below a certain number than above it, that increases your confidence that the number is low. Or, If the character you’re interested in spends most of their days very close to a certain number, that increases the likelihood that this number is close to the correct average.

I hope this helps people interpret data in future discussions (and not only Mercy based ones).

Note: Special thanks to Helel for pointing out some calculational errors. The numbers are now corrected.

22 Likes

That was a hefty read, but as a mathematician, and just overall someone who approves of logic and data… :+1:

5 Likes

no way in hell im gonna read that

sorry

It’s nice to meet a fellow mathematician here. What field are you in? I’m a low dimensional topologist/geometric group theorist.

2 Likes

Hey, someone that actually knows how statistics actually freaking works. Now there are two of us! Two of us on the whole forum.

Edit: make that three. I’m shocked. All 3 of us are here now.

3 Likes

Lol. Worst party ever.

1 Like

Secondary Math Education: with a preference to teaching Trig and Precalc

2 Likes

Oh wow, I don’t envy you - that’s really hard work. I have kids that age, and the thought of discussing math with their friends is daunting.

Though to be fair, I am leaving to go back to college for sound and audio design.

As much as I love math and teaching it…That’s about 30% of what teachers do, and I unfortunately can’t handle the 70% baggage. Still tutor though.

1 Like

If it makes you feel any better, some kids are different. I wasn’t, but I had a friend who had a genius level intellect. In high school, sometimes I would go over to his place for dinner. Dinner time conversations were when he would mathematically discuss wormhole theory with his father. With numbers and x’s. It boggles my mind to this day. It’s my understanding that he went on to be a scientist.

1 Like

Yeah, I hear that. My brother is a teacher and my parents used to be. I can see what they have to go through, and there’s just so much non teaching. I think it’s been getting worse over time.

Although, isn’t it common for people who have lots of mercy hours stay private because they’re asked to often to only play mercy?

That’s a good point - It’s not a priori impossible for private profiles to be correlated with skill at Mercy. I was worried about something like this myself, so I checked the Overbuff data. For the first few days of private profiles, on every single character I checked, the win rates went up by a tiny bit. But then they quickly settled back down to their averages. The data gets a bit noisier, but that’s it.

My guess is that early adapters of public profiles wanted to show off. After a few days, enough people switched it on for the stats to settle back to normal.

It’s like I’m reading an exam question :disappointed_relieved:

Eh, will be intresting to try ANOVA, to test how hardly correlate “buff/nerf introduction” with pickrates. Because i think, “many people just abandoned her without even trying to play, because they was never main her.”
I mean “bandwagon effect”.
And it not only affects on mercy, i think, there was much Ana players with good win rate, before buff. And honesty buff doesn’t change her at all.

It same with Zen, Lucio.

I think, people just tired of playing mercy or being forced to play.

While this is possible, I don’t think that’s what we’re seeing. The lower win rate preceded and accompanied the drop in pick rate. This is different than what happened with her January 29th nerf. The pick rate dropped, but the win rate didn’t. Eventually, the pick rate rose back up. Right now that seems incredibly unlikely given the low win rates.

I think the obvious explanation here is the correct one: hero gets heavily nerfed. Hero is no longer viable. Pick and win rates drop accordingly.

Still, it would be an interesting thing to check out. I think there’s lots of room for interesting analysis of statistics based Overwatch balance.

Great work! :ok_hand:

Math major here, too

Thanks! What kind of math are you interested in?

Great work, OP. Just beautiful.

It’s true that on bell curve 66% of values are within one standard deviation of the mean, but how is the standard deviation of the mean of n coin flips 1/sqrt(2)?

When X is a Bernoulli-distributed random variable with success probability of p=0.5 (as in Mercy win-loss game), it’s variance is p(1-p)=0.25. Then the sample average of n trials should converge to N~(p, p(1-p)/n), no?

Am I missing something?

1 Like