Successful PTR Stress Test -- Thank you!

Pazorax-1880 · June 20, 2020, 12:58am

Hey Everybody!

I want to personally thank everyone who came out to help with our stress test yesterday. We had the whole WoW Classic team in there, and we really enjoyed our interactions with all of you.

Here are a few more details about the test.

What we saw

A lot of people asked during the test if performance is going to be that bad in the live game, while some joked that they thought it was ready, and we should ship it. Or maybe they weren’t joking? After all, the experience in our stress test yesterday was pretty similar to the original AQ gate opening in 2006. We had a ton of lag, some server crashes, and when players gave up and the population dwindled, the event finally completed. We are planning to do better than that, but we won’t be able to eliminate the lag entirely.

I especially want to thank all of the players who were stuck at the end of flight paths, because we found and fixed the issue with that. As with many issues, once we found the root cause it was easy to fix and turned out to be contributing to other problems, too. So to all of you who saw a geometry salad at the end of your flight: thank you; you made this better for everybody.

If you hung around until about 5:00 p.m. PDT, our test conditions changed to a point where the lag you were seeing was close to what we expect to see in the live game. We have no choice but to make a trade-off between server lag and population density. The more people, the more lag, and eventually, with too many players in the same area, the lag gets so bad that the server thinks it’s deadlocked (a fancy computer science word for “stuck and can’t recover”), and it restarts.

A restart because the server thinks it’s deadlocked is a crash, but it presents a special challenge. Other kinds of crashes happen when a program is trying to do something really bad, so we find the bad thing, and fix it, and that’s it. Deadlocks are more challenging because there’s no single problem, just a lot of jobs getting further and further behind. There are still improvements that we can make to address this.

Population density

Our first problem is one of exponential scaling. Imagine a Blizzard that hits 10 players, which applies an aura to each of them, slowing their movement (you talented Improved Blizzard for the slow, right?). For each player that gets the slow aura, we also have to send a message to all nearby players to notify them that the aura was applied. That means a total of 100 messages. 10 affected players sending 10 messages each (one to the person who cast Blizzard, and nine to the other players affected by the Blizzard.

If there are 20 players present for the Blizzard to hit, that’s four times as many messages. If it hits 40 players, that’s 1600 messages. Doubling the players multiplies the work by four. Going from 10 players in an area to 100 players in an area takes us from 100 messages to 10,000 messages for that one spell. We already have powerful hardware in place, so this is a matter of understanding how many players we can support without deadlocking, which was a big goal for the stress test, and we got some very good data because of how many of you joined us.

Optimizing code

This is something we’ve been doing since 2004, and over the last few months we’ve been optimizing code with this specific event in mind. Here are some recent examples.

First, let’s consider the slow aura. What if we didn’t send all the aura update messages immediately? If you hit 100 people with Blizzard, do you really need to know at that exact second that each one of them has the slowing aura applied to them? The server knows right away, of course, so the aura has its effect, and is slowing their movement, but if you didn’t see the aura on them for a second or two, would that be okay? If it means the server doesn’t crash, our answer is yes, so we allowed the aura messages to be delayed. This also has an additional benefit, because if another aura update occurs while this first is waiting to be sent, we can combine those messages and send fewer messages overall. That results in fewer packets on the wire, and less work for the server.

Another code optimization we tested yesterday had to do with facing. That’s a piece of information about which direction each player is pointed. What if we slow or stop updating facing messages once the population reaches a certain threshold? It turns out that the cost is small: players appear to pop around a bit when moving. When an area is overcrowded, players already pop around a bit while moving, so this can be a huge performance win that has no visible effect. In fact, I think I saw players popping around less severely with this optimization than I would have seen if it wasn’t present.

We also improved the performance of deciding who to send messages to. When thousands of players gather in an area, merely deciding who needs to know about your aura updates and which direction you’re moving is a lot of work, so we improved that as well.

Moving players

Once we’ve addressed the previous issues, we have to consider this. When AQ first opened in 2006, we had GMs manually teleporting people out of the zone to allow the event to progress. Later, designers built automatic systems to teleport players out. Today, we have automatic teleports that perform very well, and we use them to control the zone so that it caps at a number of players that we think is still playable. Silithus will definitely be laggy, but we’d rather teleport players out than have it crash.

This event spans a lot of southern Kalimdor, so being unable to get into Silithus doesn’t actually mean you missed the event. There are Anubisath and Silithid to kill in Tanaris, Thousand Needles, Ferelas, and The Barrens for the entire 10 hours following the ringing of the gong.

Once more

Yesterday’s test gave us a pretty good idea of what these limits should be, and how we recover from a crash, but we’d like to know more, so we’re going to set up for another test on Thursday, June 25, at the same time (3:00 p.m. PDT). We’ll try to complete it more quickly this time, since there should hopefully be less investigation, and fewer disruptions.

I hope to see you all there.

P.S. Lethon says he really enjoyed your spirits with a side of mushrooms.