Okay, so part of the disconnect here is I am coming at it from an angle you don’t understand. I’ll try to give the short version as best as possible, but it’s not an exhaustive explanation.
What you seem to not understand is, and forgive the comparison, but I hope it helps make the point, in this discussion where you have asserted that you are showing a quantification of “skill” you are the same as the person who comes here and tells us the game is rigged. As such, you have the burden to prove your analysis meets some common assumptions about data, uses of data, normalcy, etc.
Saying, “well, maybe, but I don’t think so” is insufficient because it’s handwaving away alternative explanations, and that’s what you keep doing in this discussion, which you would never accept from someone who says “trust me, bro, it’s rigged.” Never.
So hold that thought and lets go back to the numbers. Two main issues are your observations aren’t independent and you can’t describe outliers.
When you do any sort of statistical analysis you typically “clean” the data, a process that involves checking assumptions and describing your sample. My largest complaint is one that you can’t answer from your data and you can’t rule out from your data is a question of influence. More on that in a second.
First, you are not considering the fact that you don’t have 366 independent observations, you have repeated measures of a finite pool of players that you are treating as independent observations for the sake of a mean (this is a no no for the type of analysis you’re doing, btw). You don’t have 366 different players versus enrage warrior, you have 20-100+ individuals playing repeated games against the deck.
We agree that skill does in fact play a role, and it would stand to reason that some of those players are better with the deck than others. In addition, some play more games with the deck than others, and when those two facts are put together, the average (winrate) is affected.
BUT, you can’t quantify this in your data set and you can’t rule out if there are outliers in your sample causing the difference you see. (curiously this works in both directions - people playing deck X exceptionally poorly in bronze can make a deck look awful when it’s actually T2 or 3 at higher ranking)
So, influence, if you have a distribution it looks like a cluster, right? Each individual in your group has an average win rate and they are all different. There’s a median point and there are dots above and below that line. At D4-1, most of the dots are going to be concentrated near that median because in sweaty tryhard land off the rank floor you don’t expect someone to have a 40% win rate and play 100 games. BUT There will be dots that are very, very far from that center of the cluster.
How far they are from that line in either direction, in standard units, dictates their influence on the results. As an example, in multivariate stats we would use a statistic to model this in a multiaxial distribution before any analysis is done, and cases that are outliers are typically dropped from the analysis prior to any regression or other multivariate analysis. (Mahalanobis Distance, for example, is a common one and wiki gives an adequate treatment of the concept if you’re curious for more depth to what I’m describing here. Should be some pictures if that helps you understand it more, too. I don’t mean this in any condescending or flippant way - pictures can help with the concept for people who are visual learners)
You can’t look at any of this in the aggregate data to conclusively tell us if there weren’t five or six or maybe even 50 players with phenomenal win rates that pulled up the average at D4-1 and then stopped playing the deck. This data isn’t even collected and it’s really important to establishing confidence in what you’re trying to do.
We don’t prove the hypothesis, we reject the alternatives and you can’t reject the alternatives because this data doesn’t measure skill.
If you could take the individual players instead of the individual games, you could actually measure skill by looking at individual player’s standard deviation from the population mean, but your analysis doesn’t actually measure anything because the data is wrong for what you are trying to prove.