Literature Review

Lock and Nettleton (2014)
Lock and Nettleton use a Random Forest to predict game outcomes.They use many standard features to describe the game state—score differential, down, distance to first down, yard line, timeouts remaining for each team—but they added a clever feature, “adjusted score”, where

\(adjustedScore = \sqrt{\frac{scoreDifferential}{seconds+1}}\)

This forces the model to gain a sense of the interaction between score and time remaining. Additionally, the ability of a Random Forest to find interactions between variables as well as nonlinear relationships between predictors and the response makes it a good choice for modeling win probability in NFL games. Many predictors seem to have nonlinear effects and situational interactions that are difficult to generalize. The utilization of ensemble based learning methods to help tackle this problem is one that we also used.

Paine et al. (2019)
PFR models change in score differential from the current game state until the end of the game as:

\(final \: score \sim \mathcal{N}\)(\(\mu,\,\sigma^{2}\)) where
\(\mu = score \: differential + EPA + point \: spread \times (seconds \: remaining/3600)\) and
\(\sigma^{2} = 13.45 \times (seconds \: remaining/3600)\)

This allows PFR to leverage the fact that any draw from this distribution that is greater than zero corresponds to a win for the team possessing the ball and any draw that is less than zero corresponds to a loss. PFR is vague about how EPA is modeled, mentioning only that it is meant to capture the “expected average scoring consequences” associated with the down, distance to first down, yard line and time left of the game state. Though PFR’s model is relatively barebones and the page describing it does not really outline how PFR handles game states with five minutes or fewer remaining—the point at which they presume the assumption that change in score follows a normal distribution breaks down—the trick PFR uses to apply a normal distribution to the problem of win probability prediction is one that we will borrow. We also utilize a scaled point spread to help inform the \(\mu\) parameter of our normal distribution.

Horowitz, Ventura and Yurko (2018)
Win probability is modeled through a two step process in this paper. First, a multinomial logistic regression model is fitted to specify a probability distribution for the type of scoring play that will occur next given the variables: down, seconds remaining in the half, yard line, a log transform of yards to go for a first down, and indicator variables for whether the current set of downs is “goal-to-go”, meaning the end zone is the first down marker, and whether two minutes or fewer remain in the half. The seven possible outcomes include a touchdown, field goal, or safety by the current offense, a touchdown, field goal, or safety by the current defense, or no score within the same half. The multinomial regression is computed by running a logistic regression that compares the probability of each type of scoring play to the baseline of no scoring plays for the remainder of the half and normalizes the combined results of the paired logistic regressions to find a valid probability distribution. Next, the expected value of the resulting probability distribution is added to the current score differential to produce an expected score differential. The expected score differential is plugged into a GAM, along with the timeouts remaining for each team, seconds remaining in the half, a ratio of expected score to seconds remaining, and the indicators for goal-to-go and whether two minutes or fewer remain. This GAM predicts the probability that the team possessing the ball will win the game. The decision to first predict the next scoring play before attempting to predict the overall win probability is one we intend to mirror. We will also make the choice to model the expected points added by field position as the expected value of a multinomial distribution instead of as the expected value of a continuous distribution. One quirk of the model is that it does not consider team strength when outputting predictions—each game begins with both teams possessing a 50% chance of winning per the model. The authors actually designed this model merely to have a coherent method of calculating the expected points added or subtracted by the players on the field4 and adding in team strength would “artificially inflate (deflate) the contributions made by players on bad (good) teams” (Horowitz, Ventura and Yurko, p. 15). As such the model is purposefully leaving a little bit on the table.

Friedman (2001 & 2002)
Friedman’s Gradient Boosting Machine works by additively fitting a succession of decision stumps to create a forest. Gradient descent is performed using a specified loss function and each successive tree is fit to minimize the loss function. The ability to specify a loss function to use in the gradient descent process allows gradient boosting to be extended to the multinomial distribution, making it a good fit for some of the multinomial classification problems that pop up in this paper. Boosting models offer many of the same advantages as the random forests used by Lock and Nettleton (2014).

Schatz et al. (2019)
Football Outsiders created a proprietary metric called Defensive-adjusted Value Over Average (DVOA) that they use to measure offensive, defensive and special teams success on a play by play basis. DVOA adjusts for context, meaning a five yard pick-up on third and four is worth more than a nine yard pick-up on third and fifteen, and uses a system of “success points” to evaluate plays. Teams with an offensive DVOA of ten percent are ten percent better than league average. Defensive DVOA works the opposite way—a team with a defensive DVOA of ten percent is ten percent worse than league average. DVOA is one of the most commonly used measurements of offensive, defensive and special teams caliber, hence its use in this paper.

Eager et al. (2019)
Pro Football Focus has a team of analysts that watches every play of the NFL season and assign grades to players for each play. The grades are based on a rubric before being adjusted for the difficulty level of each play (i.e. was a quarterback being pressured when he threw or did he have a clean pocket). These grades may be subject to human error.

Burke and Katz (2017)
ESPN’s Total QBR works by analyzing each play a quarterback is involved in and classifying the play as either a success or a failure. Burke and Katz then turn this classification into a score by factoring in the competitiveness of the game situation5, the quarterback’s level of responsibility for the play outcome, and the level of difficulty of the play.


  1. The model is just one section of their paper!

  2. Competitve situations are weighted more heavily.