Scoring rule

In decision theory, both a scoring rule as well as a scoring function provide an ex post summary measure for the evaluation of the quality of a prediction or forecast. They assign a numeric score to a single prediction given the actual outcome. Depending on the sign convention, this score can be interpreted as a loss or a reward for the forecaster. Scoring rules assess probabilistic predictions or forecasts, i.e. predictions of the whole probability distribution $F$ of the outcome. On the other hand, scoring functions assess point predictions, i.e. predictions of a property or functional $T(F)$ of the probability distribution $F$ of the outcome. Examples of such a property are the expectation and the median.

Scoring rules answer the question "how good is a predicted probability distribution given the observation of the actual outcome?" Scoring rules that are (strictly) proper are proven to have the lowest expected score if the predicted distribution equals the underlying distribution of the target variable. Although this might differ for individual observations, this should result in a minimization of the expected score if the "correct" distributions are predicted.

In the same way, scoring functions answer the question "how good is a point prediction given the observation of the actual outcome?". Scoring functions that are (strictly) consistent (for the functional $T$ ) are proven to have the lowest expected score if the point prediction equals (or is among) the true functional of the underlying distribution of the target variable.

Scoring rules and scoring functions are often used as "cost functions" or "loss functions" of forecasting models. If a sample of forecasts and observations of the outcome is collected, they can be evaluated as the empirical mean of the given sample, often also called the "score". Scores of predictions of different models or forecasters can then be compared to conclude which model or forecaster is best.

For example, consider a probabilistic model that predicts (based on an input $x$ ) a gaussian distribution ${\mathcal {N}}(\mu ,\sigma ^{2})$ with mean $\mu \in \mathbb {R}$ and standard deviation $\sigma \in \mathbb {R} _{+}$ . A common interpretation of probabilistic models is that they aim to quantify their own predictive uncertainty. In this example, an observed target variable $y\in \mathbb {R}$ is then held compared to the predicted distribution ${\mathcal {N}}(\mu ,\sigma ^{2})$ and assigned a score $\mathbf {S} ({\mathcal {N}}(\mu ,\sigma ^{2}),y)\in \mathbb {R}$ . When a probabilistic model is trained on a scoring rule, it should "teach" the model to predict when its uncertainty is low, and when its uncertainty is high, and it should result in calibrated predictions, while minimizing the predictive uncertainty.

Although the example given concerns the probabilistic forecasting of a real valued target variable, a variety of different scoring rules have been designed with different target variables in mind. Scoring rules exist for binary and categorical probabilistic classification, as well as for univariate and multivariate probabilistic regression.