Skip to content

3. Exploratory Data Analysis

Tara Nguyen edited this page Dec 24, 2020 · 21 revisions

Previous: Data Wrangling

Contents


Numbers of Teams That Finished in the Top 4

From Season 2015/16 to Season 2019/20, a total of 104 teams played across four leagues:

  • 24 in the Bundesliga
  • 27 in La Liga
  • 24 in the Major League Soccer (MLS)
  • 29 in the Premier League (EPL)

The following graph shows the number and percentage of teams that finished in the top 4 across the five seasons in each league. A small subset of teams have been dominating each of the three European leagues.

  • In the Bundesliga, only 7 teams finished in the top 4. Bayern Munich won the league in all five seasons.
  • In La Liga, only 6 teams finished in the top 4. Barcelona and Real Madrid were the only 2 teams that won the league.
  • In the Major League Soccer, 12 teams finished in the top 4, four of which won the league from Season 2015/16 to Season 2019/20.
  • In the Premier League, only 7 teams finished in the top 4, four of which won the league in the five seasons.

numteams.png

Before diving deeper into the statistics for the top 4, let's look at the statistics for all teams in each league. I focused on three statistics: points per game (computed by dividing the number of total points at the end of a season by the number of matches played that season), win proportion (computed by dividing the number of wins in a season by the number of matches played that season), and points after each week in a season.

Points Per Game

Here is a graph of points per game (PPG) as a function of season-end position, averaged across seasons for each league. A more gradually declining line indicates that teams' performances are closer to one another, which in turn suggests that the league is more balanced. The MLS appears to be the most balanced, and the Bundesliga the least balanced. La Liga and the EPL are quite similar to each other in terms of competitive balance. PPG in the MLS is lower than those in the other leagues for the first 5 positions, but starts to surpass PPG in the Bundesliga at Position #6, and surpasses those in all the other leagues from Position #8 onward. This suggests that it is more difficult for the top teams in the MLS to earn points than it is for the tops teams in the European leagues, which in turn offers another piece of evidence for a higher level of competitive balance in the MLS.

ppg-position.png

To have a clearer picture of how close teams' performances were to one another, I computed the difference in PPG between each team and the team immediately below them in the season-end table. For example, for the team in the first place, this measure was computed by subtracting from their PPG the PPG of the team in the second place in the same season of the same league. The following graph shows the distribution of this measure for each league. A lower mean value indicates that teams' performances are closer to one another, which in turn suggests that the league is more balanced. Again, the MLS is the most balanced of all the leagues. This is true even if all the outliers (the dots in the graph) are removed.

diffppg.png

Next we have the differences in PPG plotted against season-end position, with lower values suggesting more competitive balance. Most values are under .15, equivalent to either 1 win and 2 draws or 2 wins. In general, the values are lower for the MLS than for the other leagues, with only one value in the MLS above .15. This confirms what we have learned so far from the analyses: the MLS has the highest competitive balance. Also note that, for all leagues, the values for the top 2 teams are higher than those for the teams in the middle positions, suggesting that it is easier to tell apart top 2 teams from the rest than to tell the middle teams apart. This is more obvious for the European leagues (especially the Bundesliga) than for the MLS, which corroborates the claim that the MLS is the most balanced.

diffppg-position.png

Win Proportions

The results for win proportions were quite similar to those for PPG. The following graphs shows win proportions as a function of season-end position, averaged across seasons for each league. Similar to what we saw in the graph of PPG, here the line for the MLS has the most shallow slope, indicating that the performances of teams in this league, compared to those of teams in the other leagues, are the closest to one another. This in turn suggests that the MLS is the most balanced. Additionally, the win proportions in the MLS are lower than those in the other leagues for the first six positions, but are higher for the remaining positions. This suggests that the top teams in the MLS are less likely to win than are the top teams in the other leagues, which in turn is another piece of evidence for stronger competitive balance in the MLS.

winprop-position.png

Just like for PPG, for win proportions I computed the difference between every two teams that were adjacent in season-end positions, and then plotted the differences averaged across seasons for each league as a function of position (graph below). The graph tells a story similar to what we have seen from the other graphs (though it might be less clear here than it was from the graph of the differences in PPG): the MLS has the highest competitive balance. Most values are under .065, equivalent to more than 2 wins. Again, the values for the top 2 teams are higher than those for the teams in the middle positions, suggesting that it is easier to separate top 2 teams from the rest than to separate the middle teams from one another. This is especially true for the Bundesliga and the Premier League, indicating a lack of balance between the top 2 teams and the rest in each of these leagues.

diffwinprop-position.png

Points After Each Week in a Season

In the following four graphs, the number of points after each week in a season is plotted for each season-end position, averaged across all five seasons of each league. The closer the lines representing the positions are to one another, the higher the competitive balance. We can easily see that the MLS is the most balanced. In the European leagues, there is more balance among teams in the middle positions than either that between the top 2-3 teams and the rest or that between the bottom 3 teams and the rest. Furthermore, if we consider just the teams in the middle positions, the Bundesliga is the most unbalanced league.

weekpts-bundesliga.png

weekpts-laliga.png

weekpts-mls.png

weekpts-epl.png

Also in each graph are two vertical lines:

  • The brown line denotes the earliest time at which it would be possible to tell who the top 2 teams of the season would be.
  • The blueish line denotes the earliest time at which it would be possible to tell who would win the league.

The later time at which either of those two lines occurs (i.e., the further left of the graph they are), the smaller the gap between the top 2 teams and the rest, which in turn suggests higher competitive balance. We can see that the Bundesliga and the EPL are quite similar to each other. For La Liga, the brown vertical line occurs at Week #16, only one week earlier than when it occurs for the EPL. However, it only takes 3 weeks for the blueish line to occur for La Liga, indicating that the gap between the league champions and the other teams in the league usually manifests itself very early in a season. Notably, both of the vertical lines for the MLS occur later than those for the European leagues (though only by 1-2 weeks). Once again this suggests that the MLS is the most balanced.

Points Per Game and Win Proportions for Teams That Finished in the Top 4

Let's now zoom in and look at only the top 4 positions in each season. Finishing in the top 4 is especially important for teams in the European leagues, because doing so will earn them a spot in the next season of the highly prestigious Champions League. The next two graphs show the PPG and win proportions for these positions in each league, averaged across all five seasons. A higher bar indicates that it is more demanding to finish in that position. For a particular league, if the bars have relatively similar heights, that suggests higher competitive balance among the top 4.

top4-ppg.png

top4-winprop.png

Overall, finishing in the top 4 is the most difficult in the EPL, followed by La Liga, the Bundesliga, and finally the MLS, where it is pretty easy to finish in the top 4. The differences among the leagues are not big, though. To know the minimum PPG and minimum win proportion needed for a team to finish in the top 4, let's look at the statistics for the fifth position in each league (averaged across all five seasons):

League PPG Win proportion
Bundesliga 1.62 .45
La Liga 1.63 .46
MLS 1.60 .45
EPL 1.81 .54

Thus, the race for the top 4 is indeed more demanding in the EPL than in the other leagues.

Another thing that the two top-4 graphs show us is that, as suggested by the similar heights of the bars, the competitive balance among the top 4 is the highest in the MLS. In La Liga it is also rather balanced, though not as much as it is in the MLS. In the Bundesliga and the EPL, there is quite an obvious gap between the top 2 teams and the other two, and an even bigger gap between the league champions and the runner-ups. This is consistent with what is shown in the graph of the differences in PPG and in that of the differences in win proportion.

Comparing the Top 4 With Teams Outside the Top 4

Finally we have graphs comparing teams that finished in the top 4 with those that did not in terms of PPG and win proportion. In both graphs, the difference between the top-4 and the non-top-4 is the smallest in the MLS, again suggesting that this league is the most balanced. The other leagues are quite similar to one another.

top4vsnontop4-ppg.png

top4vsnontop4-winprop.png


Next: K-Means Clustering of Teams Based on Performances