Skip to content

2. Data Wrangling

Tara Nguyen edited this page Dec 18, 2020 · 6 revisions

Previous: Introduction

Contents


Data Sets Before Import Into R

Data for the project came from https://www.transfermarkt.us/. For each season (from 2015/16 to 2019/20) of each of the four leagues (the Bundesliga, La Liga, the Premier League [EPL], and the Major League Soccer [MLS]), the following statistics were obtained:

  1. Each team's season-end position
  2. Number of matches played in the season
  3. Number of wins each team got
  4. Number of draws each team got
  5. Number of losses each team got
  6. Number of goals each team scored
  7. Number of goals each team conceded
  8. Number of total points each team got at the end of the season
  9. Match results: whether or not each team won ("W"), drew ("D"), or lost ("L") each match in the season

All of the above statistics except #9 for all four leagues were saved in one data set. The statistics in #9 were collectively called a form table; for each league, there was one data set containing form tables for all five seasons (see the files whose names start with "form-" in the main branch of the repo).

Because team names in the original data were too long and contained characters unfamiliar to R (e.g., "á" and "é" in Spanish team names, or "ö" in German team names), they were modified before the data sets were imported into R.

The Final Data Set

To create the final data set, first, the match results in the form tables into cumulative number of points after each week in a season, with 3 points awarded for a win, 1 point for a draw, and 0 point for a loss. Then, the new form tables were merged with the league table data set and save as a CSV file (see R script in the repo).

The final data set has 399 rows (observations) and 45 columns (variables). Each row contains the following information about a team in the four leagues from Season 2015/16 to Season 2019/20:

Index Column/Variable name Information
1 league The league in which the team played
2 season The season in which the team played
3 position Season-end position; lower values means higher position
4 team Team name
5 matches Number of matches played;
38 for La Liga and the EPL, or 34 for the Bundesliga and the MLS
6 wins Number of wins
7 points Number of total points at the end of the season
8 week1 Number of points after the first week in the season; values 0, 1, or 3
9 week2 Number of points after the second week in the season
. ...
41 week34 Number of points after the 34th week in the season
This is the final week in the Bundesliga and the MLS; therefore there is no value for these two leagues from the next column on.
. ...
45 week38 Number of points after the 38th week in the season (final week in La Liga and the EPL)

Next: Exploratory Data Analysis