Streaks in Overwatch, Simulation and Analysis and Data

Too long, didn’t read

If your SR over time looks like “Skill Rating vs Game, CDF Statistics” at Win/Loss Simulation and Data - Google Sheets, that does not mean that Blizzard is using broken matchmaking. Random variables, on the other hand, are not your friend.

This is a repost from the legacy forums, as they were shut down, and this post still applies. Since then, I created a much more expansive simulation, posted at https://www.reddit.com/r/OverwatchUniversity/comments/aatezy/why_match_quality_is_frequently_poor/.

Introduction

There has been a great deal of posts lately implying that because there are win/loss streaks, and win/loss streak reversals, there must be some sort of illicit Blizzard meddling going on. In general, these posts show a lack of understanding of how random numbers work, which links such as https://wizardofodds.com/image/ask-the-wizard/streaks.pdf show. However, that link and others are not explicitly focused on what goes on in Overwatch, so I did a more relevant simulation. The graphs are at Win/Loss Simulation and Data - Google Sheets and the text description is below.

Coin-flip statistics

First, consider a case where a player has a 50% chance of winning, and a 50% change of losing each game, and has played 1000 games (starting at 2500, +25 for a win, -25 for a loss). This is coin-flip statistics. The player’s SR over time will look like the plot “Skill Rating vs Game, Coin-flip Statistics”. Note that even though the player’s starting rating is 2500, it dives all the way down to 1200 before working its way up. If we attempt to ascribe narrative to this, we would say that there was a huge weight on the player for ~500 games, which was then somewhat lifted. However, since we can look at the code, we can verify that there was no such effect. It was only random coin flip chance. We also know that there is really no such thing as “true rating” because each game is independent from the past. Running the simulation many times leads to random final SRs.

Not only does the overall trend line go through large swings, but there are also many streaks, as “Streak Frequency, Coin-Flip Statistics” shows. Streaks of up to 6 are common. Streaks of up to 15 occurred in this simulation.

The final plot shows the autocorrelation function, which measures “Assuming a player won game 0, what is the probability that a player won games in the past and the future?” As determined by the model, we get the expected result: There is no influence on future or past games based on the current game. (If a win at game 0 guaranteed that you lost game 1, then there would be a spike to -1 at x=1.)

Cumulative normal distribution function statistics

Of course, it isn’t really coin flip statistics. As a player goes up in rating, wins become harder. As he goes down in rating, wins become easier. I did a second simulation, in which win probability is determined by a cumulative normal distribution (mu = 2500, sigma = 500), as shown in “Win Probability vs Skill Rating, CDF Statistics”. Put simply, if the player’s rating is 2500, his win rate is 50/50. If the player’s rating is 5000, his win rate is zero, and if the player’s rating is 0, his win rate is 100%. There is a smooth s-curve between 0% and 100%. This modification fixes the problem of random SR drift, and the trend over time averages at 2500. However, there still are large trends down and up. From game 425 to 575, the player gains more than 400 SR. He then falls all the way back down in about 125 games. Throughout, there are many large and small swings. However, the underlying math has never changed, and his “true rating” remains 2500. Streaks without interruptions are slightly shorter, but still frequent and long, as the plot “Streak Frequency, CDF statistics” shows.

Here the autocorrelation looks basically the same, indicating no correlation between games. In fact, though, if I run a simulation with 100000 games instead of 1000, there is a very slight negative correlation. This corresponds to the increased difficulty in winning as rating goes up (and decreased difficulty in winning as rating goes down).

Real Game Data

Thanks to Porkypine and Des, I was able to analyze the game data at OW ELO hell - Google Sheets and compare it to my model. This is shown in the third and fourth row of charts. All of the charts look the same as the CDF plots, within the limits given by the error bars. Win Probability vs Skill Rating is consistent with a cumulative distribution (or a straight line, for that matter). Skill rating vs game shows the same sort of motions. Streak frequency has the same fall off. The win/loss autocorrelation function is one for the same game, and zero (within the errors of the measurement) for other games, which means that win probability is not based on past games, and blizzard is not taking into account past games (including streaks) when matchmaking in any measurable way (outside of wins leading to slight difficulty increase and losses leading to a slight difficulty decrease because of rating change).

If anyone has more data that they would like me to look at for weirdness or otherwise, I would be happy to do so.

If anyone would like to play with the code, it is at win_loss_simulation.m - Pastebin.com and win_loss_analysis.m - Pastebin.com
You will need Matlab with the statistics toolbox and curve fitting toolbox to run.

7 Likes

Very interesting. How would you factor in the precision and validity of ranks? By that I mean, that rank comes from concrete situations where you win or lose. If you win or lose also depends on your actions a lot. When your team mates don’t play correctly, you can still mostly do something that makes you win anyways. My description won’t be completely on point but let’s try that idea. The player with the higher accuracy wins 100% in a 1v1. We have 5 Soldiers, one has 20%, the other 25%, the other 30%, the other 35% and the other one has 40%. When they are in a team with 6 players, they win according to their contribution and naturally have different SR. Let’s expand the idea. It’s not only about accuracy but about how many different situations you win. You see the strategy and one player wins 5 overall situations per game, the other one 10, the other one 15 and so on. Longterm, you are placed according to your overall playing quality as well as the 5 other players on the team. As I said, it’s rough but hopefully somewhat understandable. Overwatch itself is highly deterministic unlike games like poker where probabilities are part of the strategy itself. The reason there are still fluctuations in SR is human conditions, how fit you are at a moment and that you always win a specific situation and you always lose another specific situation and which situation happens is random. In reality, we would need to be a lot more detailed about what exactly about Overwatch is random.

I definitely like the wisdom that you can’t influence everything in a match. I could’ve written something more but I’ll leave it for now. Hope you see it after all the time lol.

I think the other side did not understand how matchmaking and the actual game market works.
Apart from the field of computer security, there is no profitable reason for a game publisher to use features that use random numbers.

They profit from nondeterministic algs using prng all the time? Imagine not having the word random anywhere in ur code. You can do it, but how profitiable will that game be vs. one that has leveraged prng.

In terms of matchmaking: personally I’d rather take my chances as a mix of randoms vs. a mix of randoms. It’s fast, clean, and fair. Rigged games might be fun for some but finding a baseline without a reset is kind of a joke.

2 Likes

I think you’re getting a little confused here.
Non-deterministic algorithms and pseudo-randomness are different things at first.
To have something non-deterministic, you need a property from nature that changes unpredictably over time. Otherwise, the result can be determined very well in advance.

I would admit that the matchmaker has a pseudo-random component, but that this is very much limited by further specifications given by Blizzard.
This would also explain why the resulting effects only now become more apparent as the number of players decreases and the issue becomes more prominent.

I realize this but at a working level nd algs use prng seeds which is why i said prng not rng.

Okay, and I’m just saying that what you call seed in a random number generator could be very precise specifications from Blizzard.

At this point you can use statistics, like the thread creator did, but in my opinion this doesn’t lead to anything. To keep it short, only Blizzard can say with certainty how matches are formed and where randomness plays a role.

We have neither the amount of data nor the possibility to do independent experiments and the matchmaking could very well be trimmed to look random and only intervene when exceptions occur and an exception could be just about anything.

I have thought a lot about why a matchmaking system cannot be completely transparent in the way it works.
And I can only think of one good reason why the way the Matchmaker works must remain secret:
:moneybag: <-- This is also one of the reasons why no game developer wants to have randomness in the game that affects the gaming experience.

1 Like

Idk man. What if there is something like if you block enough earth shatters you are put into a forced win lobby next game against boosted boys.

I would prefer if this type of feature was obscured from playerbase as it detracts from objective of the game.

1 Like

I don’t think they literally meant to say that there’s an actual RNG algorithm in the matchmaker’s code that’s deciding the outcome of games. OP’s point was more along the lines that even when things are properly randomized, you can still get “severe” trends that might not initially suggest things are fair, despite things being completely fine under the hood.

When talking about the matchmaker, the “rng” is in the actual players. The code itself doesn’t do anything to skew things one way or the other, it just tries to find an even match and let the players be the ones to determine the true outcome. Point is, even if things are perfectly balanced, streaks are a very real and common occurrence, so anyone pointing to them as “evidence of things being rigged” don’t really know what they’re talking about.

1 Like

In my opinion, such questions should not even arise in a reputable e-sports game.
When it comes to creating fair matches, there shouldn’t be anything under the hood that a player is not allowed to know about.

This is a good example! I don’t think that something like that would distract from the actual objective of the game.
The players would be much more effective in achieving the goal.

It’s not as if the secret criteria used to evaluate a player’s performance were made up out of thin air. If all players meet the secret criteria as good as possible, the quality of the games will increase.
After all, these criteria were formed by observing games.

Again, there is SO MUCH speculation here. You’re making the same unfounded, completely unprovable presumptions you made in the other thread, while at the same time being incredibly confident, for some reason, that they’re all correct. I don’t get it.

Sorry, just to be clear here, are you actually making a claim that there is no “illicit meddling” go on? You are not just making a claim that you don’t believe there’s no illicit meddling, but that there is in fact, provably, incontrovertibly no illicit meddling?

A player that is 100 SR above another player is usually a better player. A player that is 500 SR above another player is almost always better (assuming sufficient games played).

The mechanics are deterministic, but (at least on ladder) the players are very random. I tend to think of players as the cards that I’ve been dealt. A good player can win with bad cards, but it is harder.

This line is very poorly thought out. If the better player won 100% of the time, the game would be very boring. This is a large part of the appeal of games like hearthstone and poker. A worse player usually has a chance.

Yes

I agree that the matchmaking and rating system should be public.

To be precise, my statement is that the behavior of in-game streaks is consistent with a fair matchmaking system.

This does not technically prohibit the possibility that the system is rigged in such a way as to leave in-game streaks intact, but mess with some other behavior that I did not test.

1 Like

Gotcha. Fair enough.

Let randomness simply exist where it belongs, in the real world.
Do not introduce pseudo-randomness by a computer program to smooth out statistics. This would be unsportsmanlike.

The better the players are, the less random mistakes happen in the game, at the same time these mistakes become more important.
If the players are actually matched according to their skill, then matches always remain exciting and the outcome is not 100% predictable.

Overwatch doesn’t do this. The main source of randomness is who is put on each team, which can’t be removed on ladder. It depends on whoever happens to be on-line and in queue at any given moment. Removing it would require set teams and schedule, like Overwatch League.

Players are matched according to the game’s best estimate of their skill.

How can you be so sure about that? I hope you understand what I am trying to say.

Hi, I want to thank you for posting this. Actually taking the time to read through some of it literally made me a more well-informed person, not just about Overwatch, but about the way things work in general. There is a wealth of information in the link about “Computing your skill.” It immediately put to rest a lot of my frustration regarding my thoughts/feelings on “I’m improving but I’m still not ranking up!” I now know that it’s all relative. However much I think I’m improving is irrelevant, and only my results tell me whether I’m improving. Heck, I might actually be improving but it may be that others are improving at roughly the same rate. Or I might be improving in some area (if I am actually objectively improving at all) but it’s still not enough to boost my results, or it’s not in a significant enough aspect of the game.

I now have considerably less frustration with Blizzard and the TrueSkill system does seem very elegant except for one problematic thing in my experience: climbing in my platinum account (which is a much newer account by the way) is SO MUCH easier than climbing in gold on a very historied account. I’m sure if I knew enough about the system, the answer to this would also come to make sense.

Anyhow, yes, people who are frustrated with the system or think Blizzard is holding them back should read what you’ve posted here. I’m now converted on this.

To sum it all up: it’s not Blizzard holding you back, it’s people with more skill holding you back. And how much you have to improve to climb is not up to you. It’s up to your competition. And the only way to know whether what you’re doing is useful or not is to play, unfortunately, a lot, and see what your results are.

Kaawumba, on that last point, in terms of the rate of progress up the ladder, is there any way for Blizzard to, in a fair way, improve the speed of climbing so it doesn’t take quite as long as it does? Or does accelerating people’s ascent essentially break the system in some way?

The developers have said many times that players are matched according to the game’s best estimate of their skill. That hasn’t convinced the conspiracy theorists, so I doubt that they would be convinced if the full details of the system is made public.

You are welcome.

Blizzard adds performance modifiers (below diamond) and a small, and difficult to trigger, streak bonus. These help a bit, but if they are turned up too high, they cause problems. Performance modifiers are difficult to get right and can be gameable. Excessive streak bonuses can lead to people getting thrown far from their proper rank through random chance. I don’t know a good way of ranking 6v6 solo queue players quickly.

Extremely interesting stuff. Really sad this didn’t get the attention it deserved when it was posted :confused:

1 Like