I am utterly impressed and totally flabbergasted that after 18 years, Blizzard has finally started giving us intelligent, meaningful debreifs on what went wrong with something.
That’s not sarcasm.
Thank you, and keep up the good work.
It won’t always go well or as planned, but this sort of transparency makes it harder to be upset about things when that happens.
Good on you all, guys.
I am glad we have an explanation.
It still doesn’t explain how this has been par for the course for every launch on servers like Illidan. Literally every one. These systems were not in place then.
So where do we go from here…? Maybe consider an overflow sever for the first week that gets merged back and removed after the first reset?
Fact is - this happens EVERY TIME.
At least we are getting an acknowledgement this time. Back in SL, same deal, and the response was “smoothest launch ever”. Sure, if you weren’t on Illidan, Stormrage… basically any non “full” realm.
Wouldn’t have solved the problem encountered here and might have made it worse as synching with things when things are generated hella extra and unexpected network noise doesn’t get better by adding more nodes.
Every XPAC introduces new code.
Some things don’t show up until a full-load run happens (and things like “we’re generating more data transfers than we can handle” is one of those sorts of things).
It was an error, a single line of code, out of millions.
It took some time to identify and a bit more to sort out what to do and then actually doing it took more time - but it was professionally handled and this post mortem should the the way things are done in the future with regards to explaining it.
Give us some time to learn to trust you guys and much of the bad feeling of the past will likely slough away.
You sure about that? The issue wasn’t encountered across every realm. Certainly not as impactfully.
Which is precisely why scaling horizontally would help. Essentially split the load. While I get that the Wow client might not support this natively, even something as “brute force” as free transfers to a temp realm that then gets free transferred back on next reset (that is what I meant by “overflow” server).
There will always be bugs. Somehow they are only encountered on high pop realms. Consistently. Every xpac.
This I can agree with. They aren’t just ignoring it. Then again, it impacted more than just the “full” realms, so I don’t think they could just pretend it didn’t happen. Certainly not when they were encouraging everyone to watch other players play on streams - giving the issue an even wider audience.
I don’t begrudge the devs for the job they are doing. I would like to see some plans for preventing this in the future.
Identifying the current issue and solving it - does nothing for the next xpac.
This is a really neat explanation.
Only thing would be neater would be if you guys like re-enacted that scene from the HBO Chernobyl finale, but instead of it being about a nuclear power plant, it was about your servers.
Obviously I’m not a Blizzard employee, but from the description given, adding new nodes to be synched with would only have added another exponent to the exponentially large unexpected data transfers that were the culprit in the first place…
“Splitting the load” implies that the load can be split. In this case you’d be multiplying the problem as the issue wasn’t servers, it was network load and the CPU needed to process massively too much data being requested and processed.
High pop realms are very likely to see capacity-based problems before other realms by the nature of capacity-based problems. A low-pop realm might not see them at all. This was clearly a capacity-based problem.
Because there still has to be sychronization between the different shards at some point, adding a node would only exacerbate that.
That’s not going to change from XPAC to XPAC and it is precisely the sort of problems that are going to get through alpha and beta testing because neither of those come anywhere near testing full-load capacity on a high-population server.
My suggestion is that you maintain alts on low-pop servers if you believe it to be inevitable and use those toons to start new XPAC play.
So in short, next time maybe add integration test, stress test, auto scaling or maybe more sharding to the queues?
“Good software engineering isn’t about never making mistakes; it’s about minimizing the chances of making them, finding them quickly when they happen, having the tools to get in the fixes right away”
very true. But not a good excuse to have massive production failures imo.
So Monday was definitely a nightmare but good job on the recovery
Very interesting. Thank you for taking the time to explain. And thank you for working overtime to fix it as soon as possible.
I love these posts, I think it really helps players empathize with engineering and devs when we have issues such as we did at launch and the current lag issues with Azure Span.
If possible, like when you figure out why Azure Span is still laggy, please update us as to why.
This was an absolutely fascinating read. Thank you for providing us with an opportunity to see behind the curtain.
Thanks for sharing! Appreciate the transparency
As much as I hate to harp on this… y’all need to get some automation. I don’t mean unit tests. I mean bots. I don’t care if it’s AI or not, but spin up a couple of thousand headless copies of the client in docker or k8s that have local anti-cheat disabled… then have it go through a series of nasty stress tests. Or maybe just replay the event log albeit distributed through the various clients.
Sometimes you can’t find things unless it’s at scale with the real client.
Short answer: the hamster fell off the wheel.
But really thank you for this post! Always interested in how things work “under the hood”.
I love this. As players we take for granted things that blizzard does and just assume it’s mismanagement. Reading this was like a little mystery story. Y’all should make youtube videos on these, and put some scary/ominous music behind it
For real though it was cool to hear what actually happened and I hope we get more of these in the future.
Thanks for the write-up! WoW is always pushing some fantastic tech behind the scenes (which goes underappreciated a lot of time). Sucks that the launch was slightly mired by these problems, but you’ll get them next time.
I hope there are many more posts like this in the future.
Thank you for sharing this. It is a great insight into what goes on behind the scenes.
Keep up the hard work. And I appreciate the transparency and openness of these communications and explaining why it happened.
Thanks for the update! This was a really interesting read, i love getting a peak behind the curtain and seeing how things work. Really hope we get to see more of these types of posts in the future!
Very cool to hear. I do really love how awesome the Wow team has been with the player base with these types of things. The interaction and transparency are just amazing!!!
Thanks for the insight, but I need to bring this offtopic up, because it has hit a nerve with all of five of us in our whatsapp group:
Can we stop prioritizing the Horde all the time? Because of this obvious preference, the Alliance was systematically forced into a negative feedback loop, followed by the death of the raiding scene.
I want to let you developers know that it was fun while it lasted but you really need to start honoring the Alliance and its players going forward in any regard. The game already paid its price with the end of the faction war and how everything snowballed, have at least the dignity to improve the faction to some degree where they aren’t a minority but eye to eye. It is simply a bad feeling, especially when the marketing and the priority is non-alphabetical as well.