The queue issues seem pretty consistent with a problem that’s causing the game servers themselves to crash. For those that don’t recall, the game servers for each realm are served by a cluster of actual game servers. This is the basis for Anni charm hunting as only some of these servers were configured at any time to allow Uber Diablo to spawn at all and the actual server IP addresses were used to track whether you were in a “good” game or not.
With that said, I was playing both before and during the beginning of the queue increase on Friday at around 11PM EST. At first the queue numbers were low and increased over time, but more importantly was the speed at which the queue emptied. At first the queue would clear out a large number of games (100+) per update - suggesting that games were still being created quite quickly. But as time went on, the speed at which the queue cleared slowed down severely, increasing the queue length. By 3AM EST, the queue was clearing about 30-50 per update and had grown to around 2000ish. Fast forward to now and we have a queue of 6k+ and the queue is clearing less than one game per 5-10 updates, which means that games are being created very slowly.
Further still, when I was playing during the early period of the queue increase, I had no less than 3 games outright disconnect me mid-play and leave me stuck until the database caught up and realized that my character was no longer in a game and reverted my character to the last point that the game server had sent a snapshot of my character status back to the database (this is one form of the failed to join problem that people can experience). This happens when a server shuts down with active games open, and while this does occasionally happen, this is typically somewhat rare and the servers usually come back up rather shortly. To have it happen 3 times in ~1 - 2 hours is very unlikely unless there is some problem.
With this in mind, a pretty reasonable hypothesis (IMO) is that there is some problem that is affecting all US East servers that is causing individual servers to crash and either not automatically reboot or hang. I can’t speak to what the bug is (memory leak, DDoS, game creation exploit to crash the servers, who knows what), but the bug is severe enough that it disables game servers one by one. This would explain why the problem starts slow and gets worse over time as servers begin to drop off, and after a while, the problem reaches the point where it’s at now where we’ve likely got 1 or maybe 2 servers working at all on US East, with zero games appearing for anyone in any game mode.
When people are in the office, the problem can be temporarily remedied by manually rebooting the servers, but since this problem started at roughly 7:30 PM PST, this was shortly after most people had left the building for the weekend and has left no one to put out the fires as they start. My guess is that until an actual fix to the problem is found (I really don’t understand why the code base for US East and West can’t be compared to fix the issue, or just go back to a previous revision), we’ll probably experience these issues on a regular basis - particularly on weekends.
Ultimately, something needs to be done to fix the root cause of the issue. To me, it seems clear that band-aids are being applied to the problem to temporarily fix a persistent problem. While a tourniquet can be applied to fix the immediate issue, it’s not a long term solution and creates more hassle for players and more work for employees.
Either that or buy some new hamsters.