Queues are ridiculous

Aguy-grobbulus · September 3, 2022, 4:37pm

If you guys are really interested, I work in distributed cloud computing daily.

Most companies don’t manage their own servers, but rent server space from a service like Google Cloud Services, or Amazon Web Services.
They can scale up the machine type for the server they want to rent, meaning add more memory to it, but this typically only goes so far, because of a common principle in modern computing: scale out > scale up. It’s cheaper and more efficient to distribute server load across many cheap machines than it is to use a single very high memory machine to handle all the server operations in memory.

Now, the part that nobody but Blizzard devs are actually privy to is how their server operations are written, but here’s what makes most sense to me given the server features they currently support — specifically layers:

In a distributed system there is (often) a single machine called the namenode, which distributes information to the other machines, called worker nodes. The worker nodes process that information given a set of instructions (code) and then send that processed info back to the namenode, which keeps track of the states and information across all the worker nodes.

Worker nodes machine types are typically either standard (120 GB) or highmem (208 GB), and are given 4, 8, 16, or 32 (very rarely) virtual cores.

I expect that for a given server, the layers are distributed across worker nodes individually, and the number of worker nodes is dynamically scaled depending on server load, but it’s CAPPED at a certain number of layers, and for good reason: certain server operations are not and can not be layered. Those operations are the Auction House and Mailbox.

The Auction House and Mailbox need to be kept in parity across layers at all times, and players on all layers are interacting with the same Auction House and Mailbox simultaneously. Remember that the namenode needs to keep track of the state and info of all the worker nodes in real time. This means that servers are bottlenecked by the operational load of that one machine, and like I mentioned before, the highest amount of memory you can provision for one machine is 208 GB with 32 vCPu.

This probably isn’t the full picture, but I hope this represents a more thorough explanation as to the reason that “throw more money/memory at it” is not always, or even often, a solution to server load.