Thanks to everyone who replied! If you haven’t already, please make sure to take a look at the chart that illustrates the main finding:
https://i.postimg.cc/tgFkDKWT/overwatch2-season3-matchmaker-moving-average.png
(It’s unfortunate that the forums won’t allow embedding an image; having the main plot right in the middle of the text would have looked much better. As it is, the URL gets lost in the rest of the text.)
To the individual points:
This is one possible explanation yes, and I would not dismiss it out of hand, but it leaves two important questions unanswered:
- Why does the distribution reverse modes with such consistent regularity, i.e. every 25 games?
- Why does the effect appear even though every cycle spans multiple play sessions?
These are related, but let’s consider (1) first.
It is plausible to claim that players might get into a certain behavioral or psychological “groove”, and that the concomitant intensity burns out and leads to a crash. Certainly, there are many relevant variables that might drive player behavior in a way that could affect performance, and thereby outcome. The problem is that these effects ought to be local with respect to clock time, not with respect to match count. That is to say: If I burn out after an intense winning or losing streak, one would expect to be able to predict this outcome after a certain number of minutes or hours, regardless of the number of matches played. Attention definitely fatigues after some time, but attention is going to vary with time spent not matches played. However, the seasonality observed in the time series above is tied to match count, not to clock time. It would be extremely strange, to say the least, if I became fatigued or inspired at 25 matches consistently, when the total clock time to complete 25 matches varies as much as it does.
Point (2) is even more serious.
Here’s a plot from the dataset, showing how many matches I played per single calendar day:
https://postimg.cc/dk6nk5tP
One thing should immediately jump out: I never played enough matches in a single day to traverse a full 25-game cycle. Most days, I played less than six matches. If the 25-match effect is based upon some kind of player fatigue or intensifying emotional affect, how does that fatigue or affect carry over to a session that usually takes place about 24-hours later? What about the fact that I’ll be eating, sleeping, working, and doing things other than Overwatch in between? Never minding the fact that it would usually take multiple such sessions to complete a full cycle? To be honest, it seems far-fetched to propose that I stay so tired and angry between sessions that my losing streak can pick up right where it left off – and then end right on schedule.
Well you definitely shouldn’t believe everything you hear on the Internet, so I applaud your skepticism. However, I would invite you to return to the original post and read it more closely. The only point that I qualified with “as a professional” was a passing remark about how Blizzard could say “there is no ‘loser’ queue” and how that could be true in a very narrow technical sense. That point is not central to the analysis at all.
The point of the original post is that my complete match data for Season 3 showed cycles of winning and losing that repeated on a stunningly regular cadence, and that such a phenomenon simply should not arise if the matchmaker is working as advertised. The argument proceeds from logical premises and empirical data, and the original data is included in the original post if you’d like to run the numbers yourself. You could believe that I scrub toilets for a living and the facts would still remain.
This would be a sample of N=1 if my hypothesis were “are all Overwatch players experiencing this same periodic matchmaking behavior.” Now, I admit that is an interesting question, and I would love to know the answer. I agree there is reason to be skeptical that everyone is seeing this behavior. I’m playing on a rather old account, and a lot of anecdotal evidence suggests that the matchmaker is struggling in particular to calibrate MMR under this condition. Not everyone is playing an old account, and not everyone is playing from the same point on the MMR distribution, and both of these conditions probably matter quite a bit. While I think there probably are other players experiencing a similar phenomenon, I also think there are others who are not.
To the point: The hypothesis under consideration in the original post is, “Given this observed time series, how probable is it that a correctly functioning matchmaker would generate such markedly and regularly bimodal match outcomes?” If you’re familiar with statistical inference, then you’ll recognize this is perfectly analogous to measuring log-likelihood of a sample with respect to a hypothesized distribution. Of course, analyzing time series requires different analytical tools than those applied to unordered samples, because points in a time series are generally not independent or identically distributed (iid) – but that doesn’t mean analysis is impossible. In fact, folks do it all the time.
Since you seem interested in the particulars, let me break down the analysis even further. Because the original post was rather long, I worried folks’ eyes would glaze over if I also went into the details of hypothesis testing, but I’m happy to elaborate. So here we go:
I wanted to be sure that I wasn’t spuriously imposing imagined patterns to the series, so I followed the standard approach of de-trending the series and searching for possible seasonality. The original chart is striking – it’s what inspired me to write all this up in the first place, because I was shocked the first time I saw it – but I wanted to make sure that the visual intuition was reproducible from impartial quantitative analysis. I de-trended the series by taking a first-order difference, then tried fitting a sine function to the result, using scipy.minimize
, parameterized over the frequency. Sure enough, 25 plus or minus a tiny epsilon popped out as the best fitting parameter. Assuming that the seasonality here is additive and not multiplicative, I subtracted the fitted sine function from the de-trended series, and applied the Augmented Dickey-Fuller Test (ADF) to the result, to test the hypothesis that only white noise remains after the trend and seasonality are removed from the original series – i.e. that seasonality explains most of the observed variation after trend is removed. The resulting ADF statistic comes in at about -2.6, with a P-value of 0.08, which is easily strong enough to support the hypothesis at hand, i.e. that the seasonality accounts for the most of the non-stationarity observed in the original series.
To the earlier point, this,
is untrue.
The original hypothesis technically hinges on the question of whether the time series is mostly white noise after the seasonality has been removed, and this in turn can be determined by testing for stationarity. The ADF statistic does this, and the numbers show that there’s only an eight percent chance we would draw a sample this stationary from a non-stationary series. A back-of-the-envelope calculation shows that a sample size of N=202 ought to be more than enough for 90% confidence with a 5% margin of error.
The data is up there if you’d like to counter with your own analysis.
Thanks for sharing your data, as it is relevant, and I would love to see more people sharing their data, as it gives us a better insight into what might be happening.
However, this data doesn’t “contradict” my data, because that’s not how data works. It might contradict a hypothesis or a claim, but by definition data is what it is. It would be interesting to compare your results, but your dataset is missing a key feature necessary to reveal the presence or absence of streaks: You have not included wins and losses in the order that they happened. As such, we can’t run the same moving average on your data as was run above, which means we can’t unambiguously compare the two.
I would also point out that your dataset is considerably smaller than mine. The Support data contains only 52 total matches, which would be only just large enough to maybe see the cycle turn once, if it is in fact present.
At any rate, you certainly can’t combine the roles into a single dataset, because the matchmaker uses different MMRs for different roles – combining them would mix unrelated distributions, and give spurious results.
Nonetheless, if you do have the original outcome data in the order the matches were played, I would love to see it.
I addressed this point above, but it’s worth asking again: Do you really mean to suggest that my ability abruptly reverses quality, on schedule, every 25 games?
Thank you, and you’re right!
If you do find a way to break the streaks, that itself would be very interesting. It would be a little tricky to tease apart from other factors, so it’s probably worth forming a hypothesis in advance. Good question though!
Likewise, if you do keep this data in the upcoming season, I would love to see it. It’s likely something will change with the matchmaker in Season 4 – but it’s anyone’s guess what, and we won’t know if we don’t keep the record.
You’re welcome, and thanks for reading.
Also thanks for reading!
Strongly agree. I can even see, in principle, some reasons that one might use some kind of hidden metric for matchmaking – but it would be for reasons like smoothing out wild oscillations in rank e.g. for a new account. Even then, I don’t think it’s strictly necessary, and it’s clear that the existing system performs very poorly. Regardless of whether or not it’s doing what it’s “supposed” to do, it’s clear that everyone is having a bad time and hates it, and that should be enough reason to ditch it.
Phenomena like these are really noteworthy, and go back to another related problem with the current system: Groups are hidden. Blizzard argues that they don’t want people to form preconceptions of the match, and thereby “give up early”. Even if that is a worthwhile goal, the fact remains that it hides a factor that would be highly relevant in explaining how the matchmaker is giving out such strange results, and in interpreting some of the ways that a match may have gone the way it did, i.e. if one team is being hard carried by someone way outside the MMR distribution of the rest of the match, that is highly relevant. Even just seeing that information at the end of the match would be worthwhile.
Could you elaborate on this a little bit? If you’re talking about looking at the profiles of other players, I admit that I’ve had poor luck with this approach, as most of the profiles I try are private.