An Abysmal Failure

Caethes-11347 · October 26, 2021, 10:46pm

Hate to be the bearer of bad news but this is not purely a coding issue, it’s got Blizzard’s motus operandi written all over it. This is my job as well and the fact that the original servers are still running and with less issue shows that this is a combination of multiple issues but the central issue being the same one that they had with the launch of Diablo III and that is poor planning and execution. They underestimated volume like they always do and they also skimped on the quality assurance and control, opting for a few brief beta tests rather than a prolonged period with proper CSRs and bug testers. This is pure greed that lead to this, nothing less. Blizzard has servers capable of handling millions of concurrent players and regardless of the use of existing architecture they failed to spend even an iota of time on upgrading the physical server hardware as well as the baseline server coding thus leading to this abysmal failure.

You can defend them all you like but as someone who works as an electrical engineer/network & server technician, I can say with almost certainty that those excuses are just that, excuses. But i’ll go into detail and break it down further.

Per their post:
"[…]our global database, which exists as the single source of truth for all your character information and progress. As you can imagine, that’s a big task for one database, and wouldn’t cope on its own. So to alleviate load and latency on our global database, each region–NA, EU, and Asia–has individual databases that also store your character’s information and progress, and your region’s database will periodically write to the global one. "

This is unlikely as there is no realm selection outside of the Battle.net region select upon launcher login as well as character load rate switching region to region has little to no change (There should be at least a small load period while the data center establishes the upload/download of the data meaning your characters should not appear instantly). After testing connection to each different realm along with associated data transmission rates, there was little to no change, so either all databases are stored in a single location (which would defeat the purpose as usually server data centers are located regionally to minimize latency and packet loss while subdividing the workload from each individual area and preventing the servers from being overloaded with concurrent connections. These function as independent servers, or should, only communicating back to the central data hub to update the main databases.) or (and this is highly unlikely) they’ve such powerful hardware and architecture that these servers are somehow able to return nigh identical values and transmission rates down to the millisecond. Given the experiences we’ve all had, I’ll let you deduce which is more likely.

Onto their next claim.
"On Saturday morning Pacific time, we suffered a global outage due to a sudden, significant surge in traffic. This was a new threshold that our servers had not experienced at all, not even at launch. This was exacerbated by an update we had rolled out the previous day intended to enhance performance around game creation–these two factors combined overloaded our global database, causing it to time out. "

This again goes back to what I just stated, if this was purely an architecture issue then traffic would cause latency not a full on time-out. That’s generally a symptom of poor server hardware and connection, they hadn’t planned (as usual) to be able to support the sheer number of concurrent connections and it overloaded the system causing instability and eventual time-out.

Onto their later claim.
"In staying true to the original game, we kept a lot of legacy code. However, one legacy service in particular is struggling to keep up with modern player behavior.

This service, with some upgrades from the original, handles critical pieces of game functionality, namely game creation/joining, updating/reading/filtering game lists, verifying game server health, and reading characters from the database to ensure your character can participate in whatever it is you’re filtering for. Importantly, this service is a singleton, which means we can only run one instance of it in order to ensure all players are seeing the most up-to-date and correct game list at all times. We did optimize this service in many ways to conform to more modern technology, but as we previously mentioned, a lot of our issues stem from game creation."

Now lets just start by stating this is absolute bollocks. I’ve played Diablo II since it’s original launch, you were able to make hundreds of games in quick succession and even during it’s peak it rarely experienced issues. You’d have the occasional disconnect but that was usually personal connection related, not generally Battle.net and back then they definitely were using regional data centers, as was evident from the lack of cross region functionality. The fact they have instituted a one minute lockout after any ATTEMPT at game creation(even if the creation fails due to duplicate name, time-out, etc.) tells me that they fundamentally botched something in the coding, there is absolutely no reason in this day in age that an instance driven game such as this should have timed lockouts in this matter nor queues that last for hours.

In summation there is no way in Hell that you could convince me this is anything but poor planning, execution, and a lack of proper testing procedure that lead to the current state of affairs. None.