4 Gather Data

Perhaps the most important part of any data analysis is the collection of the data. If you don’t have the data necessary to support the model you want to estimate, or the conclusions you want to draw, no amount of tinkering with the model or results can save you.

The data for this project was collected using web scraping via the rvest (Wickham, 2016c) and dplyr (Wickham & Francois, 2016) packages. The scores for each game in a season for a given team can be found on ESPN’s website. For example, all of Barcelona’s games can be found here. Given that there is access to the results for any team available on ESPN, the question becomes which teams to include in the analysis. I chose teams that particpated in specific domestic leagues or in certain tournaments or competitions. The rationale for each selection is below.

4.1 Domestic league inclusion

Because one goal of the project was to predict the 5 major European domestic leagues, all teams from the English Premier League, German Bundesliga, French Ligue 1, Spanish La Liga, and Italian Serie A were included. I also included the second tier leagues from these countries, if they were available. This included the English League Championship, German 2. Bundesliga, French Ligue 2, and Italian Serie B. The other major goal was to predict the UEFA Champion’s League (UCL). Many of the UCL teams come from the major domestic leagues, however many teams do not. Thus, I also included teams from the Belgian Jupiler League, Danish SAS-Ligaen, Dutch Eredivisie, Portuguese Liga, Russian Premier League, Scottish Premiership, Swiss Super League, and Turkish Super Lig. These leagues were selected because they all had teams reach the knockout stage of the UCL, and thus would have sufficient crossover with the major leagues.

To get the URLs for each team, I’ll write a function that uses the rvest package to pull hyperlinks of of the league pages called scrape_league(). This function takes in a URL for a given league, and returns a data frame with the club names, the totals goals scored by and against each club in league play, the points each club has accumulated, and the club’s URL. See Appendix B.1 for the full function. For a full description of how the rvest package works, see this webinar (Grolemund, 2016).

4.2 Non-domestic competition inclusion

Although many of the teams participating in the UCL come from the leagues the leagues outlines above, not all due. A few come from smaller leagues that are not covered by ESPN. Thus, all teams that participated in the group stage of the UCL were included. Additionally, teams that didn’t qualify for the UCL, and some teams that did qualify for the UCL, but didn’t make it out of the group stage, play in the second tier Eurpoa League tournament. Because this tournament includes more overlap between the leagues, all teams participating in the Europa league were also included. Finally, the International Champions Cup is a relatively new event that pairs top teams from Europe against each other around the world. All participants in this event were also included.

The webpages for these competitions are formatted slightly differently than those for the domestic leagues, so I will need a slightly different function to scrape these team URLs, scrape_major_cup() (see Appendix B.2). Unlike the league scraper, this function only returns the club, and the URL for the club.

4.3 Domestic competition inclusion

Finally, in addition to domestic leagues and international club competitions, European teams also compete in domestic tournaments. These competitions were also included to increase the number of data points for the top teams, as well as to increase the overlap between the first and second tier leagues of the top countries. All participants were included from Spain’s Copa del Ray and Spanish Super Cup, Italy’s Coppa Italia, France’s Coupe de France and Coupe de la Ligue, Germany’s DFB Pokal, and England’s FA Cup and League Cup.

These webpages are also formatted differently, so I have one final function to get the team URLs, scrape_dom_cup() (see Appendix B.3). As with the scraper for the international competitions, this scraper also returns the club name and club URL.

4.4 Collect club websites

Now that criteria for which teams will be included has been defined, the purrr package (Wickham, 2016b) can be used to map the each function the corresponding league or competition URL. For more information on the purrr package see R for Data Science (Wickham & Grolemund, 2016). First I’ll create a list of URLs for each type of scraper.

leagues <- list(
  belgium = "http://www.espnfc.us/belgian-jupiler-league/6/table",
  denmark = "http://www.espnfc.us/danish-sas-ligaen/7/table",
  england = "http://www.espnfc.us/english-premier-league/23/table",
  england2 = "http://www.espnfc.us/english-league-championship/24/table",
  france = "http://www.espnfc.us/french-ligue-1/9/table",
  france2 = "http://www.espnfc.us/french-ligue-2/96/table",
  germany = "http://www.espnfc.us/german-bundesliga/10/table",
  germany2 = "http://www.espnfc.us/german-2-bundesliga/97/table",
  italy = "http://www.espnfc.us/italian-serie-a/12/table",
  italy2 = "http://www.espnfc.us/italian-serie-b/99/table",
  netherlands = "http://www.espnfc.us/dutch-eredivisie/11/table",
  portugal = "http://www.espnfc.us/portuguese-liga/14/table",
  russia = "http://www.espnfc.us/russian-premier-league/106/table",
  scotland = "http://www.espnfc.us/scottish-premiership/45/table",
  spain = "http://www.espnfc.us/spanish-primera-division/15/table",
  switzerland = "http://www.espnfc.us/swiss-super-league/17/table",
  turkey = "http://www.espnfc.us/turkish-super-lig/18/table"
)

major_cups <- list(
  champions_league = "http://www.espnfc.us/uefa-champions-league/2/table",
  europa_league = "http://www.espnfc.us/uefa-europa-league/2310/table",
  icc = "http://www.espnfc.us/international-champions-cup/2326/table"
)

domestic_cups <- list(
  copa_del_rey = "http://www.espnfc.us/spanish-copa-del-rey/80/statistics/fairplay",
  coppa_italia = "http://www.espnfc.us/italian-coppa-italia/2192/statistics/fairplay",
  coupe_de_france = "http://www.espnfc.us/french-coupe-de-france/182/statistics/fairplay",
  coupe_de_la_ligue = "http://www.espnfc.us/french-coupe-de-la-ligue/159/statistics/fairplay",
  dfb_pokal = "http://www.espnfc.us/german-dfb-pokal/2061/statistics/fairplay",
  efl_cup = "http://www.espnfc.us/efl-cup/41/statistics/fairplay",
  fa_cup = "http://www.espnfc.us/english-fa-cup/40/statistics/fairplay",
  spanish_super_cup = "http://www.espnfc.us/spanish-super-cup/431/statistics/fairplay"
)

Then I can then map the functions to the defined URLs. Because there are many teams to scrape, it’s possible to overload ESPN’s server request limit. Therefore, I’ve define a safe version of the read_html() function that will allow us to check if the website was read correctly.

library(rvest)
library(purrr)
library(dplyr)

safe_read_html <- safely(read_html)
league_urls <- map_df(.x = leagues, .f = scrape_league)
major_cup_urls <- map_df(.x = major_cups, .f = scrape_major_cup)
domestic_cup_urls <- map_df(.x = domestic_cups, .f = scrape_dom_cup)

The next step is to create a data frame of all the clubs and their URLs. Many teams participate in multiple competitions that were included, so only one instance will be kept.

url_lookup <- list(
  select(league_urls, club, club_url),
  major_cup_urls,
  domestic_cup_urls
) %>%
  bind_rows() %>%
  filter(!is.na(club_url)) %>%
  unique()

full_urls <- league_urls %>%
  select(-club_url) %>%
  full_join(url_lookup, by = "club") %>%
  filter(!is.na(club_url))

DT::datatable(full_urls, options = list(pageLength = 5, scrollX = TRUE),
  rownames = FALSE)

4.5 Scrape game data

Now that all of the team URLs have been collected, I can scrape the game data from each team’s page on ESPN. To do this, I’ll define a new scraper function, scrape_team (see Appendix B.4). Then this function gets mapped to all of the club URLs, using the same process that was used for the league URLs. Note that I also use the lubridate package (Grolemund, Spinu, & Wickham, 2016) to format dates within the scrape_team function.

library(lubridate)

full_data <- map2_df(.x = full_urls$club_url, .y = full_urls$club,
  .f = scrape_team)

Finally, there is just little cleaning to be done. I will remove duplicate games, and replace the abbreviations that ESPN uses in their scores with the full club name.

team_lookup <- select(full_data, -team_data)

full_data <- bind_rows(full_data$team_data) %>%
  unique() %>%
  arrange(date, home) %>%
  left_join(select(full_data, -team_data), by = c("home" = "abbrev")) %>%
  rename(home_club = club) %>%
  left_join(select(full_data, -team_data), by = c("away" = "abbrev")) %>%
  rename(away_club = club) %>%
  mutate(
    real_home = ifelse(is.na(home_club), home, home_club),
    real_away = ifelse(is.na(away_club), away, away_club),
    home = real_home,
    away = real_away
  ) %>%
  select(-(home_club:real_away)) %>%
  mutate(home_game = ifelse(competition %in% c("Champions Cup"), 0, 1)) %>%
  filter(!(date < Sys.Date() & is.na(home_goals))) %>%
  filter(date > ymd("2016-03-01")) %>%
  rename(h_goals = home_goals, a_goals = away_goals)

DT::datatable(full_data, options = list(pageLength = 5, scrollX = TRUE))