-
Notifications
You must be signed in to change notification settings - Fork 0
2. Data Wrangling
Previous: Introduction
Data for the project came from https://www.transfermarkt.us/. For each season (from 2015/16 to 2019/20) of each of the four leagues (the Bundesliga, La Liga, the Premier League [EPL], and the Major League Soccer [MLS]), the following statistics were obtained:
- Each team's season-end position
- Number of matches played in the season
- Number of wins each team got
- Number of draws each team got
- Number of losses each team got
- Number of goals each team scored
- Number of goals each team conceded
- Number of total points each team got at the end of the season
- Match results: whether or not each team won ("W"), drew ("D"), or lost ("L") each match in the season
All of the above statistics except #9 for all four leagues were saved in one data set. The statistics in #9 were collectively called a form table; for each league, there was one data set containing form tables for all five seasons (see the files whose names start with "form-
" in the main branch of the repo).
Because team names in the original data were too long and contained characters unfamiliar to R (e.g., "á" and "é" in Spanish team names, or "ö" in German team names), they were modified before the data sets were imported into R.
To create the final data set, first, the match results in the form tables into cumulative number of points after each week in a season, with 3 points awarded for a win, 1 point for a draw, and 0 point for a loss. Then, the new form tables were merged with the league table data set and save as a CSV file (see R script in the repo).
The final data set has 399 rows (observations) and 45 columns (variables). Each row contains the following information about a team in the four leagues from Season 2015/16 to Season 2019/20:
Index | Column/Variable name | Information |
---|---|---|
1 | league |
The league in which the team played |
2 | season |
The season in which the team played |
3 | position |
Season-end position; lower values means higher position |
4 | team |
Team name |
5 | matches |
Number of matches played;38 for La Liga and the EPL, or 34 for the Bundesliga and the MLS |
6 | wins |
Number of wins |
7 | points |
Number of total points at the end of the season |
8 | week1 |
Number of points after the first week in the season; values 0 , 1 , or 3
|
9 | week2 |
Number of points after the second week in the season |
. | ... | |
41 | week34 |
Number of points after the 34th week in the season This is the final week in the Bundesliga and the MLS; therefore there is no value for these two leagues from the next column on. |
. | ... | |
45 | week38 |
Number of points after the 38th week in the season (final week in La Liga and the EPL) |