Background

This work was somewhat motivated by a post I read on another interesting data science blog; its combination of network graphs and football seemed both accessible and visualing appealing. Due to the profileration of social media and technological advances, graph/network based approaches are becoming more common. Graph theory has been employed to study disease propagation, elephant rest sites, relationships in The Simpsons and even MMA finishes, so I wanted to try it out for myself.

I watch alot of English Premier League (EPL) football, so I’m actutely aware of its reputation as the Most Competitive league in the world, (formerly, the Best League in the World). I’m not aiming to compare the quality of each league (UEFA coefficients solves that problem), but rather determine whether the leagues themselves are becoming less competitive. This decade has seen the rise of foreign owned super rich clubs across Europe (Man City, PSG) and the domination of domestic championships by a small elite (Bayern Munich in Germany, Juventus in Italy). Then again, just last year, relegation favourites Leicester City won the EPL, so maybe the EPL has become more competitive than ever.

I suppose we need to quantify the competitiveness of a league. We’ll use two approaches: one based on graph theory and another more conventional statistical approach. I’m not particularly expecting the former to beat the latter, I just wanted an excuse to build a network graph populated with football teams.

Gathering the Data

There are numerous free sources of football data (well, at least for the major European leagues- you might struggle with the Slovakian Third Division or the Irish Premier Division). There’s a good summary here. And if you’re interested in R API wrappers, there’s the footballR package. As we want to look at historical trends within leagues, we’ll choose the csv route (APIs generally go back only a few years). The data will be sourced from this site. No need to download the files, we can import the data directly into R using the appropriate URL. Let’s start with the last year of Alex Ferguson’s reign as Man United manager (2012-13 EPL season).

#loading the packages we'll need
require(RCurl) # import csv from URL
require(dplyr) # data manipulation/filtering
require(visNetwork) # producing interactive graphs
require(igraph) # to calculate graph properties
require(ggplot2) # vanilla graphs
require(purrr) # map lists to functions

options(stringsAsFactors = FALSE)
epl_1213 <- read.csv(text=getURL("http://www.football-data.co.uk/mmz4281/1213/E0.csv"), 
                     stringsAsFactors = FALSE)
head(epl_1213[,1:10])
##   Div     Date  HomeTeam   AwayTeam FTHG FTAG FTR HTHG HTAG HTR
## 1  E0 18/08/12   Arsenal Sunderland    0    0   D    0    0   D
## 2  E0 18/08/12    Fulham    Norwich    5    0   H    2    0   H
## 3  E0 18/08/12 Newcastle  Tottenham    2    1   H    0    0   D
## 4  E0 18/08/12       QPR    Swansea    0    5   A    0    1   A
## 5  E0 18/08/12   Reading      Stoke    1    1   D    0    1   A
## 6  E0 18/08/12 West Brom  Liverpool    3    0   H    1    0   H

For each match in a given season, the data frame includes the score and various other data we can ignore (mostly betting odds). First, we must think about our network. Networks are composed of nodes and edges, where an edge connecting two nodes indicates a relationship. In its simplest form, think of a network of people, where two nodes are joined by an edge if they’re friends. We can have either undirected or directed networks. The latter means that there’s a direction to the relationship (e.g. following someone on Twitter does imply that they follow you, which contrasts with Facebook friends). We’ll keep things simple, so we’ll opt for an undirected graph.