Successful PTR Stress Test -- Thank you!

Remove the 30 instance lock cap thank you

3 Likes

Use me as a please remove the 30 instance daily cap button!

2 Likes

If you could go ahead and remove the ridiculously stupid 30 /day instance lockout, that’d be great.

2 Likes

Thank you, definitely excited for AQ!

I remember being at the original AQ gate opening and just lagged out and crashed over and over all night so hopefully the stress test helped.

Well it’s good to know the AQ event is going to go off without a hitch or any issues because of this.

Hey Everybody!

I want to personally thank everyone who came out to help with our stress test yesterday. We had the whole WoW Classic team in there, and we really enjoyed our interactions with all of you.

Here are a few more details about the test.

What we saw

A lot of people asked during the test if performance is going to be that bad in the live game, while some joked that they thought it was ready, and we should ship it. Or maybe they weren’t joking? After all, the experience in our stress test yesterday was pretty similar to the original AQ gate opening in 2006. We had a ton of lag, some server crashes, and when players gave up and the population dwindled, the event finally completed. We are planning to do better than that, but we won’t be able to eliminate the lag entirely.

I especially want to thank all of the players who were stuck at the end of flight paths, because we found and fixed the issue with that. As with many issues, once we found the root cause it was easy to fix and turned out to be contributing to other problems, too. So to all of you who saw a geometry salad at the end of your flight: thank you; you made this better for everybody.

If you hung around until about 5:00 p.m. PDT, our test conditions changed to a point where the lag you were seeing was close to what we expect to see in the live game. We have no choice but to make a trade-off between server lag and population density. The more people, the more lag, and eventually, with too many players in the same area, the lag gets so bad that the server thinks it’s deadlocked (a fancy computer science word for “stuck and can’t recover”), and it restarts.

A restart because the server thinks it’s deadlocked is a crash, but it presents a special challenge. Other kinds of crashes happen when a program is trying to do something really bad, so we find the bad thing, and fix it, and that’s it. Deadlocks are more challenging because there’s no single problem, just a lot of jobs getting further and further behind. There are still improvements that we can make to address this.

Population density

Our first problem is one of exponential scaling. Imagine a Blizzard that hits 10 players, which applies an aura to each of them, slowing their movement (you talented Improved Blizzard for the slow, right?). For each player that gets the slow aura, we also have to send a message to all nearby players to notify them that the aura was applied. That means a total of 100 messages. 10 affected players sending 10 messages each (one to the person who cast Blizzard, and nine to the other players affected by the Blizzard.

If there are 20 players present for the Blizzard to hit, that’s four times as many messages. If it hits 40 players, that’s 1600 messages. Doubling the players multiplies the work by four. Going from 10 players in an area to 100 players in an area takes us from 100 messages to 10,000 messages for that one spell. We already have powerful hardware in place, so this is a matter of understanding how many players we can support without deadlocking, which was a big goal for the stress test, and we got some very good data because of how many of you joined us.

Optimizing code

This is something we’ve been doing since 2004, and over the last few months we’ve been optimizing code with this specific event in mind. Here are some recent examples.

First, let’s consider the slow aura. What if we didn’t send all the aura update messages immediately? If you hit 100 people with Blizzard, do you really need to know at that exact second that each one of them has the slowing aura applied to them? The server knows right away, of course, so the aura has its effect, and is slowing their movement, but if you didn’t see the aura on them for a second or two, would that be okay? If it means the server doesn’t crash, our answer is yes, so we allowed the aura messages to be delayed. This also has an additional benefit, because if another aura update occurs while this first is waiting to be sent, we can combine those messages and send fewer messages overall. That results in fewer packets on the wire, and less work for the server.

Another code optimization we tested yesterday had to do with facing. That’s a piece of information about which direction each player is pointed. What if we slow or stop updating facing messages once the population reaches a certain threshold? It turns out that the cost is small: players appear to pop around a bit when moving. When an area is overcrowded, players already pop around a bit while moving, so this can be a huge performance win that has no visible effect. In fact, I think I saw players popping around less severely with this optimization than I would have seen if it wasn’t present.

We also improved the performance of deciding who to send messages to. When thousands of players gather in an area, merely deciding who needs to know about your aura updates and which direction you’re moving is a lot of work, so we improved that as well.

Moving players

Once we’ve addressed the previous issues, we have to consider this. When AQ first opened in 2006, we had GMs manually teleporting people out of the zone to allow the event to progress. Later, designers built automatic systems to teleport players out. Today, we have automatic teleports that perform very well, and we use them to control the zone so that it caps at a number of players that we think is still playable. Silithus will definitely be laggy, but we’d rather teleport players out than have it crash.

This event spans a lot of southern Kalimdor, so being unable to get into Silithus doesn’t actually mean you missed the event. There are Anubisath and Silithid to kill in Tanaris, Thousand Needles, Ferelas, and The Barrens for the entire 10 hours following the ringing of the gong.

Once more

Yesterday’s test gave us a pretty good idea of what these limits should be, and how we recover from a crash, but we’d like to know more, so we’re going to set up for another test on Thursday, June 25, at the same time (3:00 p.m. PDT). We’ll try to complete it more quickly this time, since there should hopefully be less investigation, and fewer disruptions.

I hope to see you all there.


P.S. Lethon says he really enjoyed your spirits with a side of mushrooms.

16 Likes

Isn’t it Emeriss that does the mushrooms?

Yes. Emeriss was there too. Lethon consumes your spirits.

That’s what happens when you don’t upgrade your servers in a decade and a half.

Hey Paz, thanks for conducting a stress test to see how well AQ will launch! Cheers :slight_smile:

And for the stress test put on players hitting arbitrary instance caps :wink:

Just messin’ with ya! But for real, more insight into what it is meant to target and explaining how it is designed to help mitigate in game issues would be sincerely appreciated.

Thanks for all you and your team does!

Why not disable spell casting in the zone during the event or in an area around the gong?

Will this have any impact on people trying to ring the gong during the time window? (Won’t be me but it would be sad if someone farmed all those bugs only to not be able to enter the zone). I suppose more than likely people will crash or logout during that period so everyone should have the opportunity to get in I would think

Wouldn’t that nuke PvP or PvE if something pulls?

It was an amazeing test to experience looking foward to the next one for sure

Why would they want to communicate with their paying customers that are upset at a lack of communication? Gee, beats me.

If you’re going to quote and respond to me, do the whole thing.

helpfully the next test you talk with the horde as well as the aliiance i didn’t see anything from them meanwhile I hear the aliiance keep getting not stop info.

Wonder if they are able to put more power into each zone for the severs since most EVERYONE will be in that zoen the others will be almost empty at most not needing as much power to keep it stable

So please explain how private servers like the linked one below can handle thousands in an area with massive AoE going on and Blizzard servers can’t? It’s 2020 if you can’t handle a few thousand people in one zone with little lag you’re doing it wrong. Maybe update your hardware a little. It does cost money (which blizzard doesn’t seem to want to do) but if you can’t pull off a smooth AQ opening event 15 years later with the technology we have today that’s quite sad.

These guys had a decent server it seems. Try contacting them for advice.

Massive PvP on private server. Blizzard could learn from this

Don’t mess this up Blizzard you had 15 years to figure this out

You compare apples with bananas.

8.2.2.1
Priority between players

On Nostalrius, all of the players and all of the maps were not updated with the same priority. The battleground and then raid maps were on the top priority causing players to not experience any significant delay there. Inside a single map, actions of fighting players were handled on a priority over actions from non-fighting players, or even idle players.

8.2.2.2
Priority between actions

A priority has also been established between player actions, in the following order:

  • Top priority: Movements and spells.
  • Map related actions (pet command, loot, etc.).
  • Mail, auction house, etc.

This was achieved by handling packets a different number of times depending on their category.

8.2.2.3
Visibility distance reduction

The last option to increase performance was ultimately to reduce the visibility distance. This is something that has a really high impact on the gameplay, and should be avoided if possible.

The visibility distance reduction for NPCs, game objects and players was reduced on Nostalrius when a single map update would take more than 400ms (meaning more than 200 ms of delay for spells, as they are updated twice per map update).

In practice, this would only happen on continents with a really high number of online players. With the continent instantiation system, only specific overcrowded continent areas would be affected by this reduction.

However, Nostalrius limited the visibility distance reduction to 60 yards, as the game would no longer be playable at all below this limit.

8.2.3
Players in the same area

In some very special situations, the previous optimizations are no longer sufficient to reduce the delay. For example, when thousands of players meet in the same area.

“It might not have been the right idea to have everyone on our realms at the exact same place at the same time.” - Rob Pardo on Blizzcon 2013.

In these situations, every single player’s public action (movement, mana / health modification, spell casted, etc.) has to be broadcasted to every other player in the area. For 100 players in the same area, it means 10,000 packets per second if every player is doing one action per second in average.

As this situation was anticipated for either capital raids, or special world events (world bosses release), a benchmark was created to figure out how our emulator could handle these situations.

When we saw the results, we decided to work to allow these very special Vanilla events to happen on our realm without crashing the server. We identified the main bottlenecks in these situations:

  • SMSG_(COMPRESSED_)OBJECT_UPDATE: This packet is prepared and sent (com-pressed) for every player in the area individually, whenever one value of a player changes (health/mana regeneration for example).
  • When a player moves, the server has to send them all of the objects now visible from his new position.

The map update workflow was also changed to parallelize these computations when-ever a specific area is overloaded, using a “Map-Reduce” paradigm. With this novel algorithm, Nostalrius is able to use all of the computational power available to deal with insanely populated areas.

Blizzard likely has the best hardware that money can buy. The issue of the exponential increase in client/server calculations is a problem that no one has solved. It’s been explained by people on this forum and by Blizzard as well.