GSL Finals: Debunking bad balance arguments

CollegeWings-1825 · July 30, 2021, 2:01am

Here’s my translation of all this

Rasputin:

With the dueling narratives of statistically inept dullards squawking across social media I thought I’d do a debunking of a common misconception: small samples of games are not statistically significant enough to support either conclusion.

We can calculate this quite simply. There are a lot of factors that can affect the outcome of a match ranging from skill to luck to how the players are feeling on that particular day. Any given match can have any possible combination of these factors, but we REALLY want to measure balance so how do we prove a match was caused by balance, and not these other factors? It’s very simple: for a sample of games, there must be a low probability that the outcome could occur if we assume the game is balanced. This is called the “statistical significance” and it is usually set to 1%. If there is a <=1% chance that this could occur given a balanced-game assumption, then that assumption is false.

To calculate this we use a binomial probability calculator:
https://stattrek.com/online-calculator/binomial.aspx

We punch in 0.5 for the expected result, meaning each player should win 50% of games (this is our balanced game assumption). We punch in 5 games since Dark vs Trap was 4-1 which is 5 games. We punch in 4 successes for the 4 games that Dark won. Then we click “calculate”:

https://i.imgur.com/Wv2PMz3.png

We now look at the “Cumulative probability: P(X >= x)” result, which tells us what the odds are that at least this severe of a result could occur. Those odds are: 18.75%. So this outcome could occur and is quite likely to occur under a balanced-game scenario, and is nowhere near our 1% statistical threshold, so we can definitively say the results are meaningless.

Just for argument’s sake, what would happen if we bumped it up to 100 games, but kept the win-rate the same? The odds go down to 0.0001%, which is well below our statistical threshold. Hopefully we can now appreciate the importance of sample size: to get a high statistical significance, you need a large sample.

This first half means absolutely nothing

Rasputin:

Well I happen to have a large sample that uses far more sophisticated analysis techniques. The algorithm measures players’ performance in symmetrical matchups as a baseline for their skill, since symmetrical matchups are not affected by balance and correlate highly with the true skill level of a player. This can be used to measure trends for all players of a race. Cumulatively this represents all pro games recorded, aka tens of thousands of games, and the statistical significance is exceedingly high:

https://i.imgur.com/q5uzclg.png

The slopes of the lines are what matters. They measure how race does in a matchup relative to their symmetrical matchup performance:
TvP: 0.883
TvZ: 0.905
PvT: 0.907
PvZ: 0.902
ZvP: 0.887
ZvT: 0.888
As you can see, Protoss do amazing in both PvZ and PvT, netting 2.4% better performance than Terran in PvT, and 1.5% better than Zerg in PvZ. Interestingly, Terran does 1.7% better than Zerg in ZvT. It’s important to note that these are % of elo rankings, so it means the imbalance scales with skill. What is the total expected win-rate? Well we have to integrate the elo win-rate function for all skill levels:

This section means very little. I think I’ve seen this in some past post, the methodology is unclear, the data set is obscure, and how the model works looks like nonsense put together. It also has absolutely nothing to do with what’s going on in the first half.

The real conclusion is that even if we drop all biases about how we feel about the state of balance in the game, this is a very poorly written “paper.” The fact that you tried to string two unrelated things together and act like you tried to form some sort of legitimate conclusion made me burst out laughing and made me quiet down real quick because I know you’re actually serious about this.