Layering isn't the problem. It's a symptom

Urazzi-myzrael · April 22, 2020, 12:59am

TLDR: The following explains in technical terms how and why Blizzard has resorted to layering instead of fixing the actual problem. If you’re interested, read on. And believe it or not - this IS the short version.

Blizzard? Are you listening?

Here’s the actual reason there are queues and layering: Lack of foresight and planning on the part of both management and engineering.

Now, I’m not privy to the details of how Blizzard actually has these servers set up, as that’s all NDA type stuff and no-one outside the company will (or should) have that. However, as a 30+ year veteran in the networking and IT field, I can say this without hesitation.

It doesn’t matter either way if the servers are physical or virtual. It starts with capacity. Not player caps. I’m talking CPU processing, memory and the speed of storage solutions for each core unit that represents what the players see as a “server”.

Layering was originally designed less as a tool for correcting and adjusting server capacity as it was more of a convenience for the client (or player, if you will). Since at the dawn of WOW, hardware accelerated 3D rendering was new, at least on the consumer level. Having over 300 player and NPC models together with the terrain and object meshes was brutal on old 3D cards (or worse… SOFTWARE rendering!). Layering fixed this by reducing the number of players on screen in densely populated areas and thus reduced the poly-count for client hardware to render.

Let’s say that we have a “server” that can support a maximum of 1000 players, for example. The common practice in this case is to over-subscribe that server and give it a theoretical maximum of 1500, hopefully having done enough research to know that even at high capacity times - only about 950 of those 1500 players are going to be connected and playing.

This isn’t the problem either. Or, at least, it shouldn’t be if there was enough foresight to have the ability to virtually shift some of this hardware capacity where it’s needed. Like I said, it’s common practice and if it’s done right, no-one ever notices. This is especially true if the hardware that the servers are being hosted on is virtual, EG: AWS or Azure which should be able to dynamically expand it’s capacity based on demand.

This problem becomes contractual and is limited by budget - or - how much is Blizzard willing to pay these virtual services for the amount of capacity they need, when they need it. The questions then become:

1: Has Blizzard reached their contracted capacities and if so, why are they not negotiating for higher capacities?

2: Has Blizzard reached the virtual capacity limit set by the hosts without the option of additional capacity? If so, then this is a serious oversight by the design team. Either player caps need to be reduced per server instance, OR special contract negotiations need to take place with the hosts in order to find more capacity - if available.

But what if the hardware isn’t virtual?

Typical set-ups for this use-case indicate that each “server” isn’t a single server, but a collection of satellite servers coordinated by a central director, or “root” server.

In this case, additional hardware may not be needed - IF - the satellite servers are configured in a dynamic fashion where their capacities can be freed up from one server instance and transferred to another with minimal effort.

Example:

Each server instance has 1 Root server and 6 (blades or shards or leaves or whatever vernacular your company ascribes to). Each has a capacity for 1500 players (over-subscribed).

Server A is a PVE server with 992 players active. Near technical capacity.

Server B is a PVP server with 425 players active. 43% technical capacity.

The Fix: Gracefully halt shard 6 of server B, transferring any active players to shards 1-5. Reconfigure shard 6 to be shard 7 of the server A cluster. No additional hardware needed and can be easily reversed. A good set of engineers would have foreseen this scenario and planned accordingly.

It’s not about whether or not each server instance was designed with enough capacity to begin with. It’s about whether or not each server instance was designed to be able to have ADDITIONAL capacity if the need arose.

I can’t even get in to the server software side of things without running on for nine more paragraphs. Building in the fluid dynamics of adjustable capacities and memory management starts on the whiteboard before even the first character on the keyboard is struck. If this is botched or forgotten, a complete re-write of the server engine is often the answer and in that case, everyone loses until the code monkeys hash it all out. (That’s code humor).

Always overestimate your repair estimates to both management and your customers and you can walk away with your reputation as a miracle worker. Anyone who recognizes this will be familiar with the philosophy of one Montgomery Scott of the USS Enterprise. Corny, but no less effective as an engineer’s mantra.

And to the executives who are worried about their bottom line? To you, I say this: If you are going to continue to beat the living daylights out of this dead horse, you should at least invest enough time and money to get it to look like it’s kicking every now and again. it’s not hard. It’s probably not all that expensive provided you have good, experienced people who know what they’re doing.

I’m sure there’s more to it than this. There always is. My point being, very little is impossible with today’s technology. You don’t even have to be particularly innovative or revolutionary. Most of that is already done in one form or fashion. You just have to be willing to make the investment, whether that be time or money or both. Employ the right people with the right experience. More often than not, clients cry out for what they know will fix it, when in fact the problem is unrelated. Clearly the problem we face is design. Blizz engineers should feel confident and be backed by management when they tell players to get stuffed while they work on an actual fix and not just implement a series of bandaids while the whole project continues to hemorrhage it’s life blood all over the floor .

So, go ahead Blizzard. Gimme a call. II’ll fix your stuff for ya.

Morgânosh-myzrael · April 22, 2020, 1:07am

Bruh. You’re oozing nerd.
(No, seriously, thank you, Code Monkey)

Frosto-faerlina · April 22, 2020, 1:25am

One of the issues with Classic (vs Retail, for example) is that it was not designed to be able to scale up, in the same way.

Layering, (unlike sharding, used in Retail) was designed to implement layers of the entire Classic world. This is due to things like NPCs wandering through zone boundaries, etc. and it’s likely that given more time…

… there would have been more foresight and planning. And perhaps even further iterations on top of layering.

It seems to be one of those problems of, “We need this solution for this problem we might have.” What level of resources should you throw at it, without having significant data to show that it is indeed an issue.

And then, when it is an issue (at launch for example), and all the data shows that it’s quickly becoming a non-issue… i.e. a massive spike in users at launch, followed by declining activity over time, what level of resources should you throw at going back to revisit that temporary solution and its implementation?

Then suddenly, BAM! A global pandemic! And, “Oh hey, remember that solution that worked? We could use that.”

“Oh yeah, but those side effects.”

“Umm right, but we kind of need to do something, and that works.”

“Yup.”

Lienari-zuljin · April 22, 2020, 1:27am

Classic wow was obviously designed to maximize profits on as bare minimum hardware as they could get away with. I wouldn’t be surprised if 100% of the hardware was scrapped from something else and reutilized to save money.

Blizzard sadly has been infected by Activision beyond the point of redemption.

They know people will pay to play Classic regardless of how crap the servers are or the experience is. Blizzard has gotten it into their head that quality doesn’t matter anymore. Screw testing, patch it latter.

Morgânosh-myzrael · April 22, 2020, 1:31am

Yeah, they’ve already come out & said it’s all being run remotely. Most times these days it’s hosted by other companies. It’s not always ideal.

Urazzi-myzrael · April 22, 2020, 1:40am

You raise good points. Sadly, I wasn’t able to address these without writing an entire book, lol!

So, in my experience in the industry, all of this can be completely eradicated and with a minimum of forward expense when you plan accordingly.

Another example to explain this:

Server cluster A runs the “Blowhole” server for wow classic. If you planned accordingly, Blade 5 can be pulled offline and repurposed for server cluster M running a Starcraft II server named “Hardnose”.

This is a one-two punch approach combining hardware and software that’s flexible enough to be used anywhere with a minimum of reconfiguration.

With virtual servers, this matters less. The nature of AWS and Azure is to address this very problem, among others.

If you start with the correct design philosophy, you save untold amounts of time and money, not to mention headaches for you and your clients. Design your hardware with as much general purpose design as possible. Design the software to that it’s flexible enough to deal with increased / decreased capacity from the get-go. These ensure that you never need to worry about the capacity you’ll need for the next pandemic provided you invested wisely at the start. You can’t anticipate everything. That’s impossible. But you can plan to be flexible.

Lienari-zuljin · April 22, 2020, 2:03am

That’s well and good, but the sad reality of the gaming industry is profits over everything.

Sure you can address the issues and fix them, or you can use hardware you were going to get rid of as it’s obsolete stuff that used to make up Heroes of the Storm servers or whatever else and do next to nothing with. Who cares? Why waste money on hardware and people when you can scrap together some junk and hire 3 people to manage the entirety of classic wow. People will still play it, they’ll complain but still sub and play it.

The gaming industry is on a deathspiral following Hollywood and movies. The quality has gone so far down the toilet it’s embarrassing.

Eorn-bloodsail-buccaneers · April 22, 2020, 2:05am

Very well sayd

Urazzi-myzrael · April 22, 2020, 2:16am

Yeah, and sadly that’s really on the nose of it. Don’t even get me started on the woeful state of software development these days. Most coders these days are self taught or taught by institutions that have no idea what the core principals of coding entail.

Server software should be super light weight and nearly microscopic in comparison to client software. Running on older hardware shouldn’t be a huge problem. Code like you only have the hardware capacity of a commodore 64 and refine it from there. Don’t rely on the fact that you have a 4 GHZ processor and 16 Gigs of RAM to work with. That’s just lazy.

That goes for the good folks that are making the programming libraries, as well intention-ed as they are, also. They fall in to this trap and write something with 1000 lines of code, where 200 would have done the job.

Sometimes it’s to meet unrealistic deadlines set by corporate nincompoops who shouldn’t be managing a product that they don’t understand.

Hire. From. Within.

Lienari-zuljin · April 22, 2020, 2:20am

I work IT for a totally different industry, defense sector, it’s night and day compared to the crud that the gaming industry produces.

I think the core issue is most people that work in the gaming industry are fresh out of college and have a “passion” for it. Which basically means they value working in the gaming industry over actually getting better at their job and learning to actually be a decent coder.

Frosto-faerlina · April 22, 2020, 2:25am

Sure. But, you’re also almost definitely going to miss things, and have issues. Especially if you’re dealing with distributed systems, and massive scale.

Indeed, the issues we’ve had with WoW Classic are relatively minor. I don’t claim to know the limitations the dev team may have been faced with with regards to resources, time, and retrofitting their server-side codebase from 2006 (and earlier) onto their modern infrastructure, but I am relatively confident that any data they had to suggest player behavior would have come from historical data (from Vanilla), Retail data, and possibly Private server data (though, I’d be skeptical of this).

I’d be curious as to how modern infrastructure design vs software development practices from 15 to 20 years ago defined the constraints.

All of the issues we are running into in WoW Classic have been resolved in Retail. Implementing the same resolutions though would change Classic, likely fundamentally… and this imposes technical constraints.

Indeed, this makes me suspect that part of the reason for the changes to Azeroth in the Cataclysm expansion were to roll-forward, rather than rework the design decision of the past, that were tied to the hardware constraints of the past…

Imagining the scale and complexity of player interaction in World of Wacraft makes me suspect that it would make for some very interesting PhD level research, and I doubt there are a lot of comparable examples to contrast it with.

Having said that, I think that a lot of the time when designing a system, you can plan for situations you’ve experienced (and sometimes that leads to over-engineering), and you can design to predicted scale, but it’s incredibly difficult to account for the unknown unknowns, and we see this in all types of systems across all domains and industries.

As a bit of an aside, I recently came across some of Mel Conway’s presentation slides for a talk where he brings up Fred Brooks’ Mythical Man Month, among other things, when diving into his experience with, and understanding of developing systems…

http://melconway.com/keynote/Presentation.pdf

Something that I liked from this was Brooks’s Second System Effect:

Plan to throw one away; you will anyway
First system.
Second System.
Finally ready.

I’ve always been fascinated by WoW, and how it has continuously been rolling forward; no EoL for past content, but just continuous iteration on current software and hardware (fixing the plane, while flying), and now with Classic, going back to the beginning, to do it all over again, but retrofitting old software design (server-side) on top of new hardware!

EDIT:

idk about the self-taught comment, as I feel like the most competent coders historically are autodidacts anyway, but I appreciate the point you’re making.

I feel like Go was (at least in part) meant to address this, and very much does, as a sort of passing of the torch from the original C, and UNIX developers to the Cloud generation.

I also feel like it’s worth mentioning that rather than the number of lines of code in a library, count the number of dependencies that it pulls in, for potentially a single function call that merely pads a string with some spaces, for example

Urazzi-myzrael · April 22, 2020, 3:05am

Part of the problem here stems from poor software management. Classic was retrofitted on top of the Retail engine. The Retail engine already has plenty of issues on it’s own and wasn’t really designed from the out to be backward compatible with the Classic structure. This is natural. More issues will arise through the inherent nature of trying to fit a round peg in to a square hole. The better course of action may have been to entirely strip down the retail engine to the bare metal ( I have NO idea if this was done or not, but I’m guessing not since there were likely unrealistic time constraints placed on the coding staff ) and completely rework it - BUT - also keep it compatible with current hardware configurations.

Also, a note on hardware: Obfuscation.

The actual nuts and bolts of hardware haven’t changed much since the inception of the integrated circuit. The silicon has gotten smaller and faster. A finite number of complex or repetitive maths have been given dedicated hardware instead of all having to be done by the more general purpose and slower ALU. We’ve made our data paths bigger and our transfer rates faster, but at the end of the day - most of this is obfuscated by the operating system you’re programming for. Unless you’re programming in Assembly, which is unrealistic at scale. For the most part, we’re still using the same silicon logic combinations made with transistors and resistors that we were using 50 years ago.

What have evolved are Protocols. Even these are mostly obfuscated by the OS.

If you program a game with VESA graphics for Windows 3.1 on an old 25 Mhz 486, then (theoretically) install Windows 3.1 on a modern PC (A supercomputer in comparison) your program might run faster if it wasn’t given proper timing restrictions. But it wouldn’t care one iota about what hardware it was running on.

Again, I could write a book, dude. I’m happy to see a fellow nerd getting their geek on. I used to get beat up for this back when nerds weren’t cool. Keep up the good work and never stop learning!

So much this ^

Frosto-faerlina · April 22, 2020, 3:35am

Shots fired!

I’d be interested in an example of good software management you could point to as the bar by which we can measure and determine this to be deemed “poor.”

I mean, I assume this is true, as I would with any complex software, but it isn’t immediately obvious (to me) what issues would make this worth pointing out.

Oh. Yes, there are going to be issues with complex software, this is natural.

Probably, if by unrealistic you mean, “you don’t have 10 years to figure this out, you have 1.”

Multi-core processing. Concurrent routines. CPU vs GPU. These are non-trivial differences, in that it would be incredibly difficult to program with any sort of anticipation of how to take advantage of how these things would change over time.

Certainly, we are still using Von Neuman architecture, which builds on much of the work of Claud Shannon, and naturally uses languages that use the logic gates we are all familiar with.

Such a time never existed. Nerds have always been the epitome of cool.

Ziryus-doomhammer · April 22, 2020, 3:44am

It’s not really a hardware limitation at this point. Hell if you look at the specs of the original WoW servers they would be a joke today and I’m sure the specs of whatever they have hosting it now are much much higher and bandwidth is more available now as well. And while I can’t speak for others, I’m having a much smoother classic experience than I did in vanilla.

The issue is that the game world is designed with a specific number of players in mind, and just throwing hardware or software performance tuning at that doesn’t help anything. Are there other design options they could take? Sure, however those have some significant potential impacts as well.

They could simply make the world bigger, of course then we wouldn’t really be playing the same game as the maps would be different(not to mention having impact on things like travel time). It also can’t just be turned off when temporary issues like what’s going on now go away. They could turn on dynamic spawn rates and multi tag mobs, of course that has significant gameplay impact constantly as opposed to layering.

Are there solutions other than layering? Sure potentially, are they more or less impactful? Well that’s very subjective. Especially for a situation caused by an extreme completely non game related crisis.

Atlýss-wyrmrest-accord · April 22, 2020, 3:44am

TLDR: I take this game way too seriously.

Frosto-faerlina · April 22, 2020, 3:53am

Subjective… and different for different people, if existent at all.

If you’ve spent ~40 hours/week playing it for 10+ years, it could be argued that you don’t take it anywhere nearly close to seriously enough.

Tuskamus-area-52 · April 22, 2020, 4:07am

I wouldn’t wait for that call, you are wrong. I’ll just leave a quote of my own from another thread on the subject:

Even in a fantasy setting where Blizzard can just keep throwing more and more server horsepower at the problem, it still doesn’t work. The client still has to run on our end, and there are limits to how many players can appear around us before it starts degrading the quality of game play. Even with unlimited server power we would still be bottlenecked by the client systems.

Let’s go a step further. Let’s imagine that both the server and the client had unlimited power. The gameplay itself would still have limits. We already see complaints every day about the excessive competition for tradeskill materials and quest mobs. There are only so many players the game world can support and a lot of people would argue we are already past that point now.

Ziryus-doomhammer · April 22, 2020, 4:13am

Yes I get that there were people who would have rather have just kept queues. I also think those people grossly over exaggerate that actual impact of layering.

Frosto-faerlina · April 22, 2020, 4:14am

It’s interesting that you say that, as in my particular case, I would rather have just kept queues, but I am not really objecting to the usage of layering to alleviate the queues for those who were unable to manage their login times.

Ziryus-doomhammer · April 22, 2020, 4:18am

Yes it’s almost like there’s a massive world wide event that is creating extreme issues with login times… and a temporary solution is appropriate.