Friday, 25 January 2013

Estimating Overachievement in the Premier League

In this post, I'll show how we can use a statistical model to make predictions of Premier League finishing position from measures of performance. We might then interpret deviations from these predictions as an indication of a team's overachievement or underachievement. I'll present the results for all teams participating in the 2012/13 Premier League season, and then focus a little bit on Newcastle United. Newcastle's unlikely 5th place finish in the 2011/12 season has already been analysed by various authors, for example in this examination of "luck" by Mark Taylor, but it is hoped that this post will add further insight.

Defining "performance"

Our first task is to define performance in a way that is measurable. A single measure of performance of course does not exist, but it might be possible to approximate it using a variety of measures that are available. To start with, perhaps we can think of possession as being a good indicator of performance; in principle, if a team is able to hold on to the ball for 100% of a match then it will maximise its chances of scoring and minimise that of their opponent. Chart 1 illustrates that possession is indeed correlated with Premier League finishing position. However, possession doesn't necessarily mean goals; keeping the ball for long periods of a game will be futile if the team isn't able to turn possession into goal scoring chances. Equally, if the team rarely relinquishes possession, but defends poorly when it does, then it is likely that the team will concede relatively many goal scoring chances. Therefore, we might also want to include  'average number of shots taken per game' and 'average number of shots conceded per game' in our definition of performance. However, not all teams typically create - and concede - the same quality of shots on goal. We can define shot quality in terms of both distance from goal and the position from where the shot was taken. Teams who are typically able to create chances in the penalty area, and in the centre of the pitch, can expect to score a relatively high number of goals, whilst those who typically concede such chances can expect to concede a relatively high number of goals.

Chart 1: Possession vs league position (last four seasons) 

So, now we've got our definition of "performance":
  1. average percentage possession
  2. average number of shots on goal (both taken and conceded)
  3. proportion of shots in the penalty area (both taken and conceded)
  4. proportion of shots in the centre of the pitch (both taken and conceded)
This is obviously a simplification of a complex reality, but it is hoped that these factors capture some of the most important elements of a team's performance (at least to a typical football fan - like me - who would generally prefer their team to play "attractive football", rather than try to "win ugly").

A model to predict league position from performance level

The next step is to incorporate these factors into a statistical model in order to make predictions of a team's league position, given their typical level of performance over the course of a season. The cumulative logit model (otherwise known as the proportional odds or ordered logistic regression model) is appropriate for this job; it is designed for situations where the variable of interest (league position in this case) has a natural order. Such a model has been used previously to predict outcomes in football, for example see this post by Zach Slaton on predicting results for individual matches. Unfortunately, we are not able to make predictions of individual league positions; we only have four seasons worth of data (2009/10 to 2012/13), so we only observe performance levels - and corresponding league positions - for four first placed teams, four second placed teams, four third placed teams, and so on, which is insufficient for producing reliable estimates. To get around this problem, we can group the positions into 1-4, 5-8, 9-12, 13-16 and 17-20.

For any team in any Premier League season between 2009/10 and 2012/13, the cumulative logit model gives us the predicted probability of the team finishing in the 1-4 group, the 5-8 group, and so on. The predictions are based on typical performance levels - and corresponding league positions - of all teams over the four-year period. We can take the group with the highest probability as being the team's most likely finishing group, given their typical performance level for the season. If the team actually finishes in the group with the highest predicted probability, then they have finished "where they deserve" according to their typical level of performance, as we have defined it. However, if they actually finish above where the model predicts, they have overachieved. Conversely,  if they actually finish below where the model predicts, they have underachieved. So, for example, a team who finished 10th will have overachieved by between 3 and 6 places if they were predicted to have finished in the 13-16 group. On the other hand, the same team will have underachieved by between 6 and 9 places if they were predicted to have finished in the 1-4 group. Any estimated overachievement or underachievement by a team will be due to a combination of two factors:
  1. elements of the team's performance that are not well captured by our model
  2. factors other than performance, including random events that tend to cancel out over the course of a season - this is typically described by the footballing community as "luck" (for example, benefiting from favourable refereeing decisions or staying injury-free for most of the season) and is largely not under the team's control
Some results from the model

Table 1 shows overachievement levels for each of the 20 teams participating in the 2012/13 Premier League season (after 23 games). Note that negative overachievement indicates underachievement. Stoke City are overachieving the most, by between 7 and 10 places, as they are currently in 10th place but their predicted probability is maximised in the 17-20 group. Wigan Athletic are underachieving the most, by between 7 and 10 places, as they are currently in 19th place but their predicted probability is maximised in the 9-12 group. The shading in table 1 is determined by quintile, so that the smallest 20 probabilities are shaded lightest and  the largest 20 probabilities are shaded darkest.

Table 1: Overachievement by team (2012/13)
 

Let's take a more detailed look at the results for Newcastle United, which are shown for the last three Premier League seasons (the team played in the Championship in the 2009/10 season) in chart 2 and table 2 - where the shading again reflects quintiles, but this time over the whole four-season dataset. Newcastle underachieved in 2010/11, by between 4 and 7 places; the team finished 12th but their predicted probability was maximised in the 5-8 group, closely followed by the 9-12 group. The profile of predicted probabilities in 2011/12 was very similar to that in 2012/13, with the predicted probability maximised in the 13-16 group for both seasons. Newcastle are therefore neither overachieving nor underachieving so far this season, as they currently lie in 16th place. However, the team overachieved by between 8 and 11 places last season, eventually finishing in 5th place. This is the joint largest overachievement by any Premier League team over the past four seasons, matched only by Birmingham City's 9th place finish in 2009/10, as illustrated by table 3. These results suggest that Newcastle's typical performance levels (defined by possession, and numbers and quality of shots taken/conceded) didn't change dramatically between the 2011/12 and 2012/13 seasons, but other factors contributed to the team's 5th place finish in 2011/12. These "other factors" include elements of performance that are not well captured by our model, and factors other than performance that are not under the team's control.

Table 2: NUFC's overachievement by season

Chart 2: NUFC's predicted probabilities by position

Table 3: Top three overachievers (last four seasons)

No comments:

Post a Comment