Why Match Quality in Overwatch is Frequently Poor

Too long, didn’t read:

  • Trying to measure individual players’ skill through solo-queue 6v6 matches is a really hard problem.
  • Through detailed simulations it is possible to quantify match quality given a specific system and population of players.
  • For realistic simulations, match quality is frequently poor.

Introduction
It is a truth that is generally acknowledged that solo-queue competitive Overwatch can be a frustrating experience, and the source of that frustration is often focused on the matchmaker. Previously (1) I’ve described many sources of difficulty with the matchmaker, but that discussion is very qualitative. I will now make a quantitative description, with detailed simulations. This allows a deeper insight into what is essentially a measurement problem: how skilled is each player, and what is the quality of matches that can be produced with these skill measurements?

These simulations are run using Microsoft TrueSkill, version 1 as it is well documented (2,3,4,5,6), designed to work with teams, and there is an open-source implementation available (7,8). Overwatch’s competitive system is likely, under the hood, very similar to TrueSkill, but Overwatch’s documentation is much worse. Overwatch certainly has to deal with many of the same problems. The simulations will go in order from simplistic and optimistic to detailed and realistic. In the process we will discover that in six-vs-six solo-queue matchmaking it is difficult to give the players a positive experience.

Parameters

Players: The more players, the more accurate the simulation (unless you are trying to simulate what happens in ranks/regions/platforms/days/times with few players), and the longer it takes to run.

Players Per Team: The more players per team, the harder it is to determine the contribution of each player.

Beta: A measure of how large a skill level is. For a beta of 10, if player A is ranked 10 above player B, player A will beat player B about 75% of the time. For Overwatch, a beta of 100 feels appropriate to me, which corresponds to a team of 3100 beating a team of 3000 about 75% of the time. A more precise number would require data from Blizzard. Beta is naturally larger for games that have more random variables (such as card games), and larger for games that have less skill factors (tic-tac-toe > connect four > chess > go (9)).

Mu: The value of persons rating, analogous to MMR in Overwatch. I start it at 2500 for each player, to keep things simple (even though for Overwatch it is really about 2350 (10)). The range is approximately 1-5000.

Sigma: The uncertainty in the measurement of mu. It starts off very large (833), because a new player has unknown ability. It shrinks with games played.

Tau: About how much an established (not new) player can change his mu in a fair match. For Overwatch, this is set to 24 (11).

Skill: While mu is the measurement of a person’s skill, skill is the true value, and is used to calculate the winner of each match in the simulation.

Inconsistency: Higher inconsistency means that a player is less reliable in their results. This parameter helps calculate the winner of each match.

Matches per Player per Iteration: The number of matches per player the simulator generates each iteration of the simulation. If this is 0.01, that means that 1% of the players (chosen randomly) play a match each round.

Smurf Rate: How often players reroll into new accounts. If this is 0.001, a player will reroll, on average, once every 1000 games. Rerolls are done randomly, so individual players can reroll more or less than this.

Estimated Match Quality: Before each prospective match, estimated match quality is calculated. Matches have higher quality if the mus of the two teams are close, and the sigmas are small compared to beta. Essentially it is statement of how confident the system is that the match will be fair. It will not be fair if the mus are very different. It may not be fair if the sigmas are high (because we don’t really know the mus well).

True Match Quality: Before each prospective match, true match quality is calculated. We can only do this because this is a simulation and we know players’ true skill and inconsistency. Matches have higher quality if the skills of the two teams are close, and the inconsistencies are small compared to beta. Essentially it is statement of how balanced the resulting match is expected to be. It will not be fair if the skills are very different. If player’s inconsistencies are high, then a match will be less likely to be balanced because if one or more inconsistent players have an especially good or a bad game it will end poorly.

Description of the Simulation

The code is available (12). It is written in Python, and all of the needed libraries, including the TrueSkill implementation, are available in pip (Python’s package management system). To keep things somewhat friendly, I will not be putting equations in this article. See the code and the references for mathematical details. The procedure of the simulation is outlined below:

  1. Generate skill and inconsistency for each player by randomly pulling from a specified distribution. Sort the players by skill and assign a player number. Player number, skill, and inconsistency remain fixed for the duration of the simulation. Each player starts with the default mu and sigma. Mu and sigma update as games are played.
  2. For each iteration:
  3. Pull players from the population
  4. Sort players by mu
  5. Divide the sorted players into groups of sufficient size to fill a match (12 for 6v6)
  6. Divide each group randomly into two teams
  7. For each match, if random[0,1) <= win probability, A wins, else B wins.
  8. Updated mus and sigmas with the TrueSkill algorithm
  9. Reset the mu and sigma on specified number of players to simulate smurfing
  10. Go to step 2.
Animated Figure Description

I’ll be showing a lot of animated figures, but they all have the same format, so I’ll describe what is going on once in detail. Click any of the movies below to see what I am talking about. On the left, each player’s mu and sigma is shown, for each game. (If you ever forget what a parameter is, such as mu and sigma, look in the parameters section.) In the beginning, the system has no idea what anyone’s rank is, so their mu is 2500, and their sigma is high (833). As time passes, better players win, and worse players lose, and the players get pushed to their appropriate rank, and sigma falls as the system has a better estimate of their ability. The S-curve in ranks at the end of the movie is a bell curve, but plotted on axes that are a bit different it normally is. During the movie, the precision in the rank measurement can be read off by seeing how thick the S-curve is.

The top right plot shows both the estimated (by the TrueSkill algorithm) and the true (as determined by knowing each players true skill and inconsistency) match qualities. The name of the TrueSkill ranking system is unfortunate, but I’ll capitalize and make it one word when I’m referring to the ranking system, as opposed to the parameters in the simulation. The solid lines are the median match quality The dotted lines are +/- 34.1% from media, which would be one sigma if the match quality distributions were normal (they are not). These qualities start off low, because match making is essentially random, rise as players’ skills are learned and players are sorted, and then level off as the algorithm maxes itself out.

The bottom right plot shows mu over time for five players, chosen at percentiles 10%, 30%, 50%, 70%, and 90%. For 12000 players (as is normal for this article) these are player numbers 1200, 3600, 6000, 8400, and 10800. They start off at the same rank (2500) and then filter out in to their correct locations, with some random wandering around their true rank as they gain and lose games.

Simulation #1: Simplified, perhaps past the point of usefulness

Players: 12000, Players Per Team: 1, Beta: 1, Tau: 1, Matches Per Player Per Iteration: 1, Skill: norm(2500.0,833.3), Inconsistency: 1.0, Smurf Rate: 0.0
The player distribution is shown below.

Skill is distributed in a bell curve, and inconsistency is the same for all players at 1 (extremely low). Beta is also 1, meaning that a player of skill 2501 will beat a player of skill 2500 75% of the time. The ranking over time is shown below. Click through to see the movie. Playing while embedded isn’t working because of a forum bug.

A few things jump out, most players are sorted extremely precisely relatively quickly (45 games), some take much longer. This is because tau is also 1, which means that for players with many games, mu only moves about 1 per game. If a player gets unlucky with his opponents in his early games, it can result in taking up to 350 games before his rank is correct. This case kinda breaks the TrueSkill system, but it provides a good reference point, as it shows that it is possible (assuming it was possible to create a game with 5000 skill levels) to rank people extremely precisely. Match quality isn’t particularly good, again because of how small the skill levels are. A huge fraction of the early games will be stomps, and stomps will still be common later, because it doesn’t take much skill differential for one player to totally dominate another.

Simulation #2: Realistic number of skill levels

Players: 12000, Players Per Team: 1, Beta: 100, Tau: 1, Matches Per Player Per Iteration: 1, Skill: norm(2500.0,833.3), Inconsistency: 1.0, Smurf Rate: 0.0

Here, I set beta to 100, to be more realistic. In this case, a player (or team) of 2600, will beat a player (or team) of 2500 75% of the time. This feels about right for Overwatch.

Rating accuracy is still quite tight (but not quite so tight as before), match quality is much better (maxing very close to 1), and there are no silly cases of people taking forever to get placed accurately. However, it takes a bit longer (about 75 games) for most people to get placed. The chief reason that it takes so long for people to get placed is that tau is so low. Once people have played a fair number of games, their rank starts moving very slowly.

Simulation #3: More rapid rank motion

Players: 12000, Players Per Team: 1, Beta: 100, Tau: 24, Matches Per Player Per Iteration: 1, Skill: norm(2500.0,833.3), Inconsistency: 1.0, Smurf Rate: 0.0

I’ve changed tau to 24, to allow more rapid motion in the ranks. This allows people to rank faster (about 40 games), but match quality and rank accuracy has taken a definite hit. Ranks are now accurate to about +/- 150, as opposed to +/- 25 before.

Simulation #4: Inconsistent players

Players: 12000, Players Per Team: 1, Beta: 100, Tau: 24, Matches Per Player Per Iteration: 1, Skill: norm(2500.0,833.3), Inconsistency: 100.0, Smurf Rate: 0.0

Up till now, inconsistency has been set to 1, which means that players’ performance only varies about +/- 3 per game. We know that for Overwatch this is not true, for many reasons. An inconsistency of 100 (for a full range variation of about +/- 300 is much more realistic).

Here is the new player distribution plot and simulation:

Increasing inconsistnecy mainly causes a significant degradation in true match quality, from around 0.95 to 0.65. This is to be expected as individual players perform significantly better or worse in given matches, resulting in more stomps.

Simulation #5: Six players per team

Players: 12000, Players Per Team: 6, Beta: 100, Tau: 24, Matches Per Player Per Iteration: 1, Skill: norm(2500.0,833.3), Inconsistency: 100.0, Smurf Rate: 0.0

Obviously, it isn’t Overwatch if there is only one player per team. So let’s fix that, and go to six players per team.

This is a pretty massive hit in every metric. Ranking precision is now about +/- 300. It takes about 150 games to rank everyone. True match quality is about 0.55. This is an underlying, and very important problem with ranking solo-queue Overwatch. The amount of data required to rank individuals in teams is much higher than individuals alone. More individuals on a team also makes it much more likely that something will go wrong, making the match quality poor.

As an aside, taking 150 games to rank everyone, and the “early season” poor match quality, is why it would be a horrible idea to have data resets. A reset would destroy match quality for months (and permanently if there was a reset every two months).

Simulation #6: Realistic number of matches per iteration

Players: 12000, Players Per Team: 6, Beta: 100, Tau: 24, Matches Per Player Per Iteration: 0.01, Skill: norm(2500.0,833.3), Inconsistency: 100.0, Smurf Rate: 0.0

So far, for each iteration of the simulation, every player has played a game. This is not realistic for a game where people can log on, play, and log off whenever. For this next simulation, only 1% of the players play a match each iteration. Those players are chosen randomly, so some players will (by random chance) go many iterations without playing a match.

This actually doesn’t change much, except a small decrease in ranking precision (from +/- 300 to +/- 375).

Simulation #7: Smurfs

Players: 12000, Players Per Team: 6, Beta: 100, Tau: 24, Matches Per Player Per Iteration: 0.01, Skill: norm(2500.0,833.3), Inconsistency: 100.0, Smurf Rate: 0.001

As we all know, the game has smurfs as well. Here, I randomly reset players’ ratings, at the rate such on average a player will reset his rating once every thousand games. It’s random though, so some players will reset more, and some less. After resetting, the player plays normally, in a way that is not bannable (13).

There are small decreases in the average experiences of players. The main difference is that there are many outliers in mu, corresponding to players that have recently reset. It takes 20 – 100 games for players to get back to their original rank. If a player has particularly bad luck, it conceivably could take longer (like it did for me that one time), but most accounts seem to recover relatively quickly. Examples of reset players are easiest to see in the rank over time plot in the bottom right.

Simulation #8: Trolls

Players: 12000, Players Per Team: 6, Beta: 100, Tau: 24, Matches Per Player Per Iteration: 0.01, Skill: norm(2500.0,833.3), Inconsistency: lognorm(0.4,0,120.0), Smurf Rate: 0.001

One last thing:

It isn’t really true that all players have the same inconsistency. At the low end, we have players that play a small number of heroes and never play when tired or tilted. In the middle ranges, we have average players who play more heroes and play in a wider range of moods and circumstances. Going higher, we have players who play when tilted, or drunk, like to switch heroes to practice, or whatever. Finally, at the highest end we have the trolls, who can vary anywhere between jumping off the map and totally stomping, depending what toxicity they want to unleash that day.
I model this range of inconsistencies as a log normal distribution, shown below. For each player, I assign a random inconsistency from the distribution.

Below is the simulation.

Things didn’t actually get that much worse, except for a tick down in true match quality (which is to be expected).

However, let’s take a moment to talk about this simulation as a whole, the most accurate one that I will generate in this article. Frankly, things are pretty bad. Less than half of matches have a quality of above 0.5, which is common cutoff for acceptable matches. Many matches have quality below 0.2, which are expected to be garbage matches. Rank accuracy is only +/- 400 for established, mid-tier accounts. If we reset everything, it would take about 150 games to get back to this not great solution.

Summary Table

The numbers in the last four columns are imprecise. I estimate them by looking at the charts.

Simulation Description Players Players Per Team Matches Per Player Iteration Beta Tau Inconsistency Smurf Rate Ranking Precision Match Quality Stabilizes Final Match Quality Value True Match Quality
1 Simplified, perhaps past the point of usefulness 12000 1 1 1 1 1 0 ± 3 45/350 0.5 0.62
2 Realistic number of skill levels 12000 1 1 100 1 1 0 ± 25 75/300 0.97 0.98
3 More rapid rank motion 12000 1 1 100 24 1 0 ± 150 40 0.8 0.95
4 Inconsistent players 12000 1 1 100 24 100 0 ± 250 40 0.8 0.65
5 Six players per team 12000 6 1 100 24 100 0 ± 300 150 0.55 0.6
6 Realistic number of matches per iteration 12000 6 0.01 100 24 100 0 ± 375 150 0.55 0.6
7 Smurfs 12000 6 0.01 100 24 100 0.001 ± 400/1000 150 0.5 0.58
8 Trolls 12000 6 0.01 100 24 lognorm(0.4,0,120.0) 0.001 ± 400/1000 150 0.5 0.48
Possible Improvements

TrueSkill version 1 does not include performance modifiers, so neither did I. If performance modifiers are assumed to work well, then all metrics would improve. However, that is a difficult assumption to make, especially in a game like Overwatch, where performance can be difficult to quantify and performance modifiers can influence behavior away from team play and objectives to stat farming.

TrueSkill version 2 does include performance modifiers, as well as some other possible improvements (6). I may model TrueSkill version 2 at some point in the future. However, there is no open source implementation of TrueSkill 2 (it is rather new), so I’d have to implement it myself.

Other possible mechanical improvements exist, such as looking at the record for the last N games, rather than the last 1 game to determine new ranks.

Looking For Group helps reduced randomness in matchmaking, which improves matchmaking quality. Further features along these lines could improve things even further.

Continuing in this theme, if Blizzard implemented a six-stack queue, where teams rather than players have SR, then the matchmaker would behave more like 1v1 than 6v6. The simulations above show how much better the system behaves in 1v1 mode. As a fall back, though, teams in the six-stack queue would may have to be matched against people in the regular queue when there aren’t enough teams in the six-stack queue.

Other possibilities include a clan system, or weekly or monthly tournaments for players that aren’t good enough for open division to be a positive experience.

Blizzard could do better with banning toxic personalities (throwers, boosters, ragers, hackers, etc.). With throwers and boosters, machine learning can help, but at the end of the day, all semi-credible reports should be investigated, and the investigator should be able to watch a replay of the game(s) from anyone’s perspective. Blizzard should hardware ban serial offenders.

Individual players can improve their matchmaking experience by using LFG or by playing with a regular group of people. They can also not playing while drunk, tilted, tired, etc.

References

(1) How Competitive Skill Rating Works (Season 13) -> Matchmaking -> If the match-maker says most games are fair, then why are there so many stomps?
(2) https://www.microsoft.com/en-us/research/project/trueskill-ranking-system/
(3) http://www.moserware.com/2010/03/computing-your-skill.html
(4) http://www.moserware.com/assets/computing-your-skill/The%20Math%20Behind%20TrueSkill.pdf
(5) https://www.microsoft.com/en-us/research/wp-content/uploads/2007/01/NIPS2006_0688.pdf
(6) https://www.microsoft.com/en-us/research/uploads/prod/2018/03/trueskill2.pdf
(7) https://trueskill.org/
(8) https://github.com/sublee/trueskill
(9) https://senseis.xmp.net/?HermanHiddema/GoAndChessAComparison -> Ratings
(10) Initial Competitive Skill Rating, Decrypted
(11) 3000+ Skill Rating Data and Analysis (now including DCs)
(12) https://pastebin.com/raw/xVJ8vg9V
(13) Serious question: Is smurfing cheating?

3 Likes

Tl;dr because your mmr is low, play heroes you actually get value out of…

When I run LFGs I put “must have 80 hours of comp over the last 4 seasons” in the title. This is to ensure that their SR is relatively accurate.

I will now put “must have played at least 200 recent games”

thank kaawumba

I think most people want a six stack Como, I’ve got a concept here [(6 stack clan queue concept)]

Your Tldr is still not clear enough? I mean what are you looking for? Matches where your teammates carry you? Or easier wins? Enemy team completely runs yours over?

Either way you put it, at the end of the day you gotta carry your team. If you can’t win those matches then you belong there.

i’ve spent thousands on their merchandise just for stupid poor matchups, show some loyalty pls blizzard

i just had two guys in a match i played just then and i know they were pros, then i had to match up against them repeatedly, really? the game can’t monitor POTG and massive kills and how fast the ult they generate as DATA to separate the two so they are on SEPARATE teams not together all the time?? eventually i stop getting them because they probably got bored or felt sorry for us (balance skill levels better pls)

seriously stop making new characters and maps and:

1 remove “waiting for players” the less inferior version of overwatch ‘paladins’ has a better system, a game DOESN’T start until its found a game, THEN everyone has chosen their character and ready. I turned off skirmish for a reason ok. If it goes past a certain time mark fill the rest with AI bots and start the damn match

2. balance match up please, enough with enemy team full of silvers on high level even gold, and i’m paired with a bunch of bronze that haven’t got any stars yet. Yes stars colour doesn’t always dictate skill level but its got to AT LEAST determine some basic understanding of the game, i had one at level 3 or 5 bronze NO stars, the one time i WISH they were smurfs but they weren’t. We got totally obliterated

3. When some one leaves detect connection problems so unintentional leavers don’t get punished and those that deliberately leave in the middle/end of game get punished. Also the moment the people leave AI bots replace them at the appropriate skill level (which i heard PALADINS does, the game should also be able to scan profiles and use the highest level enemy team member to determine AI bot difficulty to balance the team with the leavers. I had one match with 3 leavers and instead of feeling compassion total spawn camped us

4. ad higher benefits in ‘quick play’ i personally don’t play competitive and can’t handle the stress, i just want to play for fun, and idiots throw games on purpose by hiding and emoting and have the attitude “its just QP” whocares? (idiotic reasoning, you don’t get real money/wealth/ even trivial stuff like being on leaderboard, are you no.1? if not then logic dictates competitive is just as pointless as QP