# Validation of the backtesting process under the targeted review of internal models: practical recommendations for probability of default models

## Lukasz Prorokowski

#### Need to know

• This paper complements the validation function. The purpose of this study is to provide practical guidance on validating the backtesting process.
• For the introductory validation of the backtesting process, the validation team should assess the data quality standards, data directory and databases used for creating the backtesting dataset.
• For the assessment of the backtesting frequency, the validation team should confirm the appropriateness of the size of the observation window in order to ensure that the poor performance of the model is not masked.
• For the validation of the calibration backtest, the validation team should assess the homogeneity of the rating groups, differences between the PD estimates and realised observations and the appropriateness of the hypothesis underpinning the analysed statistical tests.
• For the validation of the discriminatory power, the validation team should at least interpret the curves with the associated statistics and assess the number of defaults.
• For the validation of the stability backtest, the validation team should assess the power of the statistical test based on its assumptions and justify any PD’s deviations using macroeconomic factors.

#### Abstract

This paper provides practical recommendations for the validation of the backtesting process under the targeted review of internal models (TRIM). It advises on the introductory  steps for validating  the backtesting process and reviews the available statistical  tests for calibration, discrimination  and stability backtesting. The TRIM regulatory exercise is an international  supervisory initiative that inspects the internal models and related internal risk and governance policies of eurozone banks that are permitted  to use the advanced internal  risk-based (AIRB) approach. Under the TRIM guidelines, the designated banks should have specific policies  and internal guidelines for the validation of the backtesting process. Further, the affected banks are required to validate the entire backtesting process. Addressing these needs, this paper serves as a basis for producing such policies and utilizing appropriate statistical tools for validating  the backtesting process. The paper focusses on probability of default models. To date, no academic study has discussed the validation of the backtesting process with reference to the TRIM rules.

## 1 Introduction

### 1.1 Purpose of the study

Forming the credit risk model management framework of every eurozone bank that uses its own internal credit risk models, the validation guidelines of the backtesting process are required for complete credit risk governance under the targeted review of internal models (TRIM), enforced by eurozone regulators (European Central Bank 2017). Currently, there are no academic studies that address the validation of the backtesting process with reference to the TRIM rules.

TRIM is an international supervisory initiative that sends European Central Bank (ECB) inspectors to review internal models used by eurozone banks to calculate regulatory capital (Raymond and Steffen 2017). As noted by Maltezos (2017), the TRIM inspection of credit risk models is very thorough and involves not only assessing models but also reviewing a bank’s credit risk management framework, validation processes and risk governance (eg, internal policies for backtesting). TRIM is mainly based on the Capital Requirements Regulation (CRR) and the European Banking Authority’s Final Draft Regulatory Technical Standards (European Banking Authority 2016). Within this regulatory framework, Article 185 of the CRR sets specific requirements for validating internal estimates and the backtesting process.

According to Basel Committee on Banking Supervision (2010) and Lore and Borodovsky (2000), the backtesting process should be fully outlined in policies and procedures as well as embedded in sound market practices. Thus, this paper has been written in conjunction with the TRIM guidelines. It presents general guidelines for validating the backtesting process to banks that are required to conduct ongoing validations of their credit risk models in order to demonstrate to regulators and senior management that the models in use are appropriate. This paper serves to provide a means of identifying issues related to backtesting credit risk models. In doing so, it outlines best practice with regard to validating the backtesting process that is aligned with existing regulatory frameworks: namely, CRR and TRIM. Thus, the guidelines ensure that the assumptions for credit risk models remain robust and known limitations are addressed. In summation, the purpose of this paper is to provide insights into the following:

• outlining the principle terminology used in backtesting;

• determining the relationship between validation and backtesting;

• exploring the data set requirements for backtesting;

• validating the calibration of PD models;

• validating the discriminatory power of PD models; and

• validating the stability of PD models.

### 1.2 Scope of the study

This study focuses on validating the backtesting process, which is defined as the quantitative comparison of the model’s estimates against realized values (Basel Committee on Banking Supervision 2010). It attempts to complement the study of Campbell (2006), which discusses the existing developments in backtesting procedures. As noted by Castermans et al (2010), backtesting is only one element of validation. However, according to Basel Committee on Banking Supervision (2010) and Jones and Mingo (1999), this area requires additional instruction. This is due to the fact that backtesting is continuously evolving, with no definite methodology or new statistical tests being specified.

The scope of this study is defined by Schnitzler et al (2014) as the entire program of conducting a backtest, which includes comparisons of not only the model’s estimates but also the bank’s portfolios, data selection processes and development of appropriate statistical tests. The backtesting process also involves investigating the backtesting results and proposals for remedial actions where needed (Loterman et al 2014).

This paper focuses on the backtesting process for a common type of credit risk model: the probability of default (PD) model. The proposed validation tools are tested empirically on two rival PD models developed and backtested by domestically significant credit institutions in Belgium. Models encompassing corporate obligors have been selected for the empirical tests because medium and large corporations are the most common types of counterparty for credit institutions in Europe.

Both PD models have been applied to the same portfolio of midcorporate obligors and backtested by the credit institutions. The backtesting data set associated with the models is contained in the online appendix. Thus, this paper is able to demonstrate the practical application of the discussed tools in the process of validating the backtesting results of the PD models.

### 1.3 Key concepts

Backtesting is a model validation process that compares an internal model’s estimates with the actual realized observations (Engelmann and Rauhmeier 2006). As indicated by Morone and Cornaglia (2009), the purpose of initiating a backtesting exercise is to evaluate the predictive power and performance of a model and how that model performs over time. Backtesting serves to flag model deterioration, ie, when a backtested model starts to underestimate or overestimate risk.

The model’s estimates are produced at a specific point in time (initialization point): this is the date on which the forecasts are issued (Kao 2000). Each estimate has a time horizon, which is the time span between the initialization point and the realization of the estimate (Allen et al 2004).

The backtesting data set contains both the model’s estimates and the corresponding realizations of these estimates. Chorniy and Lordkipanidze (2015) have shown that the backtesting data set can be constructed in various ways and might consist of the forecasts of exposures or risk factors. The time period over which the data is aggregated is referred to as the observation window (Batislam et al 2007). Hereto, Wu and Olson (2010) argue that using a small observation window negatively affects one’s ability to produce meaningful conclusions from the backtesting exercise.

The backtesting process assesses the following characteristics of the internal models: calibration, discrimination and stability (see Table 2). Hereto, calibration refers to the mapping of a rating to a quantitative risk measure (Gordy 2000). Discrimination assesses how well the model delivers an ordinal ranking of the risk measure considered (Prorokowski 2016). Stability evaluates the extent of the population drift, comparing the population used at model development with the population applied to the model in use (Berkowitz and O’Brien 2002).

The midcorporate obligors are nonfinancial firms that fall within the scope of the delivered PD models due to specific characteristics, eg, having total assets above €2 million and a turnover threshold of between €2 million and €50 million. The portfolio of midcorporate obligors, to which the two rival models described in Table 1 are applied and then backtested, remains concentrated in the following industrial sectors: construction (30%); wholesale and retail trade (27%); real estate activities (15%); manufacturing (12%); and transportation and storage (6%). The geographical breakdown of the midcorporate obligors is driven by entities domiciled in Belgium (85%), the Netherlands (6%), Luxembourg (5%) and France (4%).

## 2 Validation of the backtesting process: guidelines

This section presents the guidelines for validating the backtesting of a PD model in its entirety. Different methodological approaches are discussed here for validating the backtesting of the PD models. It is ultimately the responsibility of the internal validation teams at the affected banks to ensure the backtesting process conforms to the standards and requirements outlined in the TRIM guidelines. This section aims to provide advice on how to validate the backtesting process, from the review of the backtesting data set to the statistical tests for validating the stability, discriminatory power and calibration of the PD models. The paper recommends solutions that are considered especially useful in satisfying a TRIM inspection.

### 2.1 Prerequisite and backtesting data set

Before applying specific statistical tests for the calibration, discrimination and stability, the validation should assess the relevant policies and standards underpinning the backtesting process. This is a standard procedure recommended by Castermans et al (2010) and implemented in many financial institutions (Prorokowski and Prorokowski 2014). The TRIM guidelines (eg, Article 57(a) of TRIM) highlight the need for banks to have up-to-date internal validation policies accompanying their established procedures for validating the backtesting process.

As shown by Van Gestel et al (2006), the internal validation team is responsible for ensuring that the backtesting data set contains both the internal model’s estimates and the realized observations. It is recommended by Carey and Hrycay (2001) that internal data is used wherever feasible for the purpose of backtesting. When using external data, the choice should be justified, and the data should be subject to an acceptable level of quality checking.

As far as Article 57(b) of TRIM is concerned, backtesting should be based on the risk database and not on an intermediate extraction. Therefore, it should be ensured that the validation team has its own access to the relevant databases being used to generate the backtesting data set.

Data unavailability constitutes an important issue in the backtesting process, especially at the stage of creating the backtesting data set (Jackson and Perraudin 2000). Portfolio aggregation serves to address this problem. Wang (1998) argues that there are different ways of aggregating portfolios that should be assessed by the internal validation team. Table 3 presents the suggested focus of the validation in relation to analyzing portfolio aggregation methods.

According to the TRIM guidelines, backtesting, as per Article 185(b) of the CRR, should be performed at least once a year. This requirement is reiterated in European Banking Authority (2016). This paper shares the opinion of Jones and Mingo (1999) that a regular backtesting frequency serves to ensure the continuous good quality of internal credit risk models with timely adjustments. Thus, as a minimum, the backtesting of banks’ internal models should be performed annually.

As shown in Table 3, using a small observation window negatively affects the ability to produce meaningful conclusions from the backtesting exercise. However, using an observation window spanning a number of years over benign market conditions can mask current performance and a model’s responsiveness to stressed macroeconomic conditions (Harmantzis et al 2006). The backtesting framework should be responsive to a model’s recent performance.

### 2.2 Calibration

In backtesting, calibration constitutes the assessment of discrepancies between the estimated values and the realized observations (Morone and Cornaglia 2009). The correct calibration of an internal model means that the estimates are accurate and conform to the observed default rates (Nolde and Ziegel 2017).

As noted by Bikker and Metzemakers (2005), in practice, PD estimates deviate from the observed default rates. Validation of the calibration should focus on answering the question of whether the deviations are random or systematic. According to Basel Committee on Banking Supervision (2005a) and Van Vuuren et al (2017), the systematic underestimation of PDs causes an inadequacy of regulatory capital to cover the risk incurred by banks.

The following statistical tests are those most commonly applied for PD models to validate the calibration:

• the binomial test;

• the Hosmer–Lemeshow test;

• the traffic light approach (Tasche 2003);

• the Spiegelhalter test;

• the normal approximate test;

• the Vasicek test; and

• the Brier score.

#### Binomial test

The binomial test evaluates whether the observed number of defaults is lower than or equal to the expected number of defaults (Bernstein 1962). In practice, the test is run to check if the observed test results differ from what has been expected. The null hypothesis of the binomial test states that the PD of a rating category is correct. The alternative hypothesis is that the PD has been underestimated. The binomial test is used when there are two possible outcomes (eg, default or nondefault) and there is a notion about the PD from its estimate. Lopez and Saidenberg (2000) have stressed that the binomial test relies on an assumption that the default events in the rating category are independent. However, the review of empirical studies conducted by Basel Committee on Banking Supervision (2005b) shows that the defaults are correlated, with the correlation coefficients ranging from 0.5% to 3%. Due to the presence of correlated defaults, the true size of the type I error (an unjustified rejection of the null hypothesis) will be higher than the test indicates.

Given the aforementioned characteristics of the binomial test and the fact that the test does not take into account macroeconomic volatility, the validation team should look at the following specific aspects when validating the calibration using a binomial test:

• Is the null hypothesis correctly specified?

• Are the default events in the rating category independent? Is an assumption of independence justified?

• When using the two-sided test, is the $p$-value doubled?

#### Hosmer–Lemeshow test

The Hosmer–Lemeshow test is a goodness-of-fit test for logistic regression, especially for risk prediction models such as PD models. A goodness-of-fit test tells us how well the data fits the model (Andersen 1973). Specifically, the Hosmer–Lemeshow test calculates if the observed default rates match the expected default rates (Altman and Sabato 2007).

As evidenced by Matias-Gama and Amaral-Geraldes (2012), for the forecasted default probabilities of debtors in rating categories, the test statistic is defined such that, by the central limit theorem, it converges toward a $\chi_{k+1}^{2}$ distribution if all the estimated PDs for debtors with rating “$i$” are true default probabilities. However, it should be noted that for the very small forecasted default probabilities of debtors in rating categories, the convergence toward a $\chi_{k+1}^{2}$ distribution may be low. In practice, the test output returns the $\chi^{2}$ statistic and the $p$-value. According to Hosmer et al (1997), small $p$-values would indicate that the PD model is a poor fit: the closer the $p$-value is to zero, the worse the estimation.

The Hosmer–Lemeshow test does not take overfitting into account and tends to have low power (Steyerberg et al 2010). There must be a sufficient number of groups of ratings for the test to flag poor calibration as well as a large enough number of ratings in each group to assess the differences between observed and estimated default probabilities (Lemeshow and Hosmer 1982). The small values of the ratings mean that the test is at risk of being unable to find misspecifications. There are also some problems, indicated by Archer et al (2007), related to the choice of bins that ultimately affect the power of the Hosmer–Lemeshow test. Therefore, the validation team should look at these specific aspects when validating the calibration using a Hosmer–Lemeshow test:

• Is the data regrouped by ordering the PD estimates?

• Are there enough observations for each group to obtain meaningful results?

• Is the test applied to the logistic regression model?

• At which level of $p$-value (below 0.05) should the model be rejected due to misscalibration?

• Does the number of groups (gradings/ratings) affect the test? What is the right number of groupings?

#### Traffic light approach

In backtesting, the PD forecast for a homogeneous portfolio of loans has to be compared with the realized default rate one year later. At this point, due to the fact that default events are correlated, the binomial test becomes unreliable for backtesting PD discrimination. Tasche (2003) proposes the traffic light approach to address the shortcomings of the binomial test.

The PD estimate (forecasted default rate) is known as $p$. Two confidence levels are set:

• $\alpha_{\mathrm{low}}$ (eg, 95%); and

• $\alpha_{\mathrm{high}}$ (eg, 99.9%).

Thus, knowing $p$ and having the confidence levels set, critical values can be found using a stochastic model (the Vasicek one-factor model, which underpins the Basel II risk weight functions):

• $c_{\mathrm{low}}$, with the probability that the realized number of defaults exceeds $c_{\mathrm{low}}$ being equal to $100\%-\alpha_{\mathrm{low}}$; and

• $c_{\mathrm{high}}$, with the probability that the realized number of defaults exceeds $c_{\mathrm{low}}$ being equal to $100\%-\alpha_{\mathrm{high}}$.

Finally, the following traffic lights can be set.

Green:

if the realized number of defaults is less than $c_{\mathrm{low}}$. In this case, there is no obvious contradiction between the estimated and realized default rates.

Yellow:

if the realized number of defaults is equal to or greater than $c_{\mathrm{low}}$ but less than $c_{\mathrm{high}}$. In this case, the realized default rate is not compatible with the PD estimate. However, the difference between the realized rate and the forecast is still in the range of usual statistical fluctuations.

Red:

if the realized number of defaults is equal to or greater than $c_{\mathrm{high}}$. In this case, the difference between the forecast (PD estimate) and the realized default rate is very large and cannot be attributed solely to statistical fluctuations.

Setting $c_{\mathrm{low}}$ and $c_{\mathrm{high}}$ requires the use of the Vasicek one-factor model. Hereto, the selection of the appropriate asset correlation parameter remains problematic (Blochwitz et al 2004). It should be noted that Basel II assumes the highest correlation at a level of 0.24. All in all, for a given PD estimate $p$ and the asset correlation $\rho$, by using a granularity adjustment approach for setting $c_{\mathrm{low}}$ and $c_{\mathrm{high}}$ as well as the formulas for the derivatives developed by Martin and Wilde (2002), the critical values $c_{\mathrm{low}}$ and $c_{\mathrm{high}}$, which respect correlations, can be specified for the traffic light approach. A further approach for determining the critical values involves approximating the distribution of the default rates with a Beta distribution (Martens et al 2010), where the parameters are determined by matching both the expectation and the variance of the default rate.

The validation team should look at these specific aspects when validating the calibration using a traffic light approach:

• What is the assumed asset correlation? Is it higher than 0.24, which is the highest correlation that occurs in the Basel II rules?

• Is the chosen correlation appropriate for the type of assets/obligors?

• Is the distribution of the realized number of defaults in the observed period of time appropriately calculated (eg, using the granularity adjustment approach)?

#### Spiegelhalter test

The Spiegelhalter test is a calibration test that is conditional on the state of the economy. It has been noted by Tasche (2008) that PD estimates can be conditioned on the current state of the economy by including macroeconomic covariates in the regression (eg, the GDP growth rate). The resulting PD estimates are then point-in-time (PIT). For the PIT PDs, given the actual realization of the covariates, the assumption of independence of default events is appropriate (Upper and Worms 2004). This is due to the fact that the dependencies are captured by the incorporated macroeconomic covariates in the PD estimates (eg, unemployment rates). In contrast, the through-the-cycle (TTC) PD estimates are based on data representing a complete economic cycle and are not based on the current state of the economy (Carlehed and Petrov 2012). With reference to the TTC PD estimates, the assumption of independence of credit events is not valid.

In the context of validation, the mean squared error (MSE) can be computed for the Spiegelhalter test. Under the assumption of independence, given the score values and according to the central limit theorem, the distribution of the standardized MSE is approximately standard normally distributed under the null hypothesis (Breiman 1996). Thus, a joint test of the hypothesis stating that “the calibration of the PDs with respect to the score variable is correct” can be conducted (Spiegelhalter 1986). Similarly to the binomial test, the validation team should look at these specific aspects of default dependencies when validating the calibration using a Spiegelhalter test:

• Is the independence of defaults assumed correctly?

• Are the PDs estimated as PIT?

• Does the Spiegelhalter test remain powerful given the avoidance of the independence assumption?

• Are the null hypothesis and the underlying test specified correctly?

#### Normal approximate test

As for the binomial or Spiegelhalter tests, applying these to unconditional PD estimates may lead to results that are too conservative (Spiegelhalter et al 2002). Hereto, for unconditional PD estimates, if a time series of default rates is available, the assumption of independence over time can be justified. This is due to the fact that unconditional PD estimates (TTC PDs) are usually constant over time (Schuermann and Hanson 2004). Therefore, a simple test (normal approximate test) can be developed that ignores any assumption regarding the cross-sectional independence among obligors within a year.

For the normal approximate test, a fixed rating grade with $n_{t}$ obligors and $d_{t}$ defaulters in year $t$ is considered. In addition, it is assumed that the PD estimates are TTC, constant over time, and that defaults from different years are independent. Thus, the $d_{t}/n_{t}$ rate for a particular year is the realization of random variables. The standard deviation $\sigma$ of the default rates can, in this case, be estimated with the usual unbiased estimator.

Then, under the hypothesis that the true PD is not greater than $q$ (estimated TTC PD), and if the number of years for observations is not too small, the standardized average default rate is approximately standard normally distributed (Shapiro and Francia 1972). As a consequence, the hypothesis should be rejected if the average default rate is greater than $q$ plus a critical value derived by the approximation.

The main advantage of the normal approximate test is that it does not require any assumption regarding the cross-sectional independence of defaults (Longstaff et al 2005). Moreover, the test procedure remains robust against violations of assumptions of intertemporal independence when there is some weak dependence of defaults over time (Engelmann et al 2003). The most important assumption remains that the number of years should not be too small (Gordy 2000). Thus, the observations for time series should cover five to ten years. Given the above, the validation team should look at these specific aspects when validating the calibration using a normal approximate test:

• Is the time series of default rates available?

• Are the PD estimates TTC and constant over time?

• Is there a sufficient number of years for the observations?

#### Vasicek test

Kealhofer (2003) argues that a very simple way of considering correlation between defaults is using the Vasicek one-factor model to estimate the nonconditional PD (that is, nonconditional to the economic cycle). Given the estimated PD and asset correlation $\rho$, the Vasicek one-factor model yields, asymptotically, the quantile for the observed default rate, which is a function of the cumulative standard normal distribution, and its inverse as well as the asset correlation (Vasicek 1976; Lando 2009).

The $\alpha\%$ confidence interval can then be constructed using the quantile (Lando 2009). When the observed default rate falls outside this confidence interval, a statistically significant difference is concluded according to the Vasicek test (Rösch 2005).

Das et al (2007) suggest that the asset correlation $\rho$ can be derived from the Basel II Accord. Further, since the Basel II correlations are conservative, it is possible to use half-correlations for the Vasicek test in order to obtain more strict results (Miu and Ozdemir 2006).

The major disadvantage of the Vasicek test is that an infinitely granular portfolio is assumed (Yazici and Yolacan 2007). For finite samples, one may use Monte Carlo simulation to calculate a more precise confidence interval (King et al 2000). Given the aforementioned disadvantage, the validation team should look at these specific aspects when validating the calibration using a Vasicek test:

• Is the portfolio of obligors large enough to assume that the granularity of the portfolio is infinite?

• Are the Basel II correlations too conservative for the obligors?

#### Brier score

The last test of calibration quality is the Brier score, which is also known as the MSE. The Brier score is a common way of verifying the accuracy of the probability forecast that relates to a specific event using the estimated probability, which is already known (Lichtenstein and Fischhoff 1980). The Brier score can only be used for binary outcomes, where there are only two possible realizations: default or nondefault (Rufibach 2010).

For $N$ obligors with individual PD estimates $p_{i}$ and $y_{i}$ being a default indicator that takes the value $1$ for default and $0$ for nondefault, the Brier score takes the following formula:

 $\text{MSE}=\frac{1}{N}\sum(y_{i}-p_{i})^{2}.$ (2.1)

The Brier score is small for high PD estimates assigned to defaults and low PD estimates assigned to nondefaults (Kruppa et al 2013). A low Brier score indicates the good calibration of a rating model. The Brier score can only tell us how accurate a forecast was; it cannot inform us about the accuracy of the forecast compared with anything else (Foster and Stine 2004). The validation team should look at this specific aspect when validating the calibration using a Brier score:

• Is the obtained score low enough to reach a conclusion on the model’s accuracy in predicting defaults?

#### 2.2.1 Calibration: empirical example

Adopting a more rigorous approach to the study’s matters, a selected test is applied to the backtesting data sets submitted by participating banks (see the online appendix for the submitted backtesting data set validated using the selected statistical tools). In doing so, this paper uses an empirical example to back recommendations for validating the backtesting process and to illustrate the points raised in the previous section.

The validation team is advised to utilize rival statistical tools in order to assess the calibration of a PD model. At this point, the paper recommends stating the null hypothesis that the calibration is equal to the prediction error; the Brier score is applied to test this hypothesis. With the aim of answering the question, often asked during TRIM inspections, of whether the obtained Brier score can be utilized to comment on a model’s calibration, this paper introduces specific traffic lights to map the Brier score (see Table 4).

Specifically, for the TRIM exercise, the validation team is required to have predefined actions in place that correspond to certain values of the Brier score. With this in mind, we suggest a PD model recalibration for orange and red traffic light flags. This paper argues that such traffic lights or, similarly, decision trees should be embedded in the relevant backtesting and validation policies in order to satisfy the TRIM inspection as well as facilitate the process of drawing conclusions with regard to the Brier score.

As an empirical example, the Brier score is applied to the backtesting data set containing sample credit portfolios of midcorporate obligors in order to measure the calibration of the PD models of Bank 1 and Bank 2. For the backtesting results of both PD models, the null hypothesis is tested using Spiegelhalter’s $z$-statistic, which allows for a formal assessment of the calibration of the targeted PD models:

 $Z(p,y)=\frac{\sum_{i=1}^{n}(y_{i}-p_{i})(1-2p_{i})}{\sqrt{\sum_{i=1}^{n}(1-2p_% {i})^{2}p_{i}(1-p_{i})}}.$ (2.2)

The null hypothesis of perfect calibration is rejected at confidence level $\alpha$ if the following applies:

 $|Z(p,y)|>q_{1-\alpha/2}.$ (2.3)

Table 5 reports the validation results from the adoption of the rival statistical tool, namely the Brier score. Normally, the values of the Brier score would be sufficient to reach a conclusion on the model’s accuracy in predicting defaults. However, given the stringent requirements of the TRIM framework, the validation team is advised to report Spiegelhalter’s $z$-statistic for the calculated Brier scores. This serves to confirm that the reported Brier scores are statistically significant.

Based on the reported test results for the Brier score and the corresponding Spiegelhalter’s $z$-statistic, this paper confirms that the PD models of both Bank 1 and Bank 2 pass the calibration backtest. Bank 1’s PD model appears to be slightly more accurate, given the lower value of the Brier score. Further, Spiegelhalter’s $z$-statistic indicates better PD model calibration for Bank 1. The empirical results reject the null hypothesis that the calibration is equal to the prediction error, meaning that the PD models of both banks are not perfectly calibrated. However, the low values of the Brier scores, falling in the dark green traffic light category (DG), imply that recalibration is not needed.

### 2.3 Discrimination

In backtesting, discrimination constitutes the assessment of the ability of a model to assign low estimates to bad realized observations, and vice versa (Castermans et al 2010). The correct discrimination (discriminatory power) of an internal model means that the model adheres to standards of objectivity, accuracy, stability and conservatism (Prorokowski 2016; Prudential Regulation Authority 2013). Complementing the above information, it should be noted that regulators are increasingly demanding that banks provide evidence that the logic and quality of their rating systems (including the quantification process) both adequately discriminate between different levels of model estimates (Basel Committee on Banking Supervision 2001, 2010).

The discriminatory power of a PD model denotes its ability to discriminate ex ante between defaulting and nondefaulting borrowers (Prorokowski 2016). The discriminatory power can be assessed using a number of statistical measures of discrimination:

• the cumulative accuracy profile (CAP);

• the accuracy ratio (AR);

• the receiver operating characteristic (ROC);

• the area under the curve (AUROC);

• the Kappa coefficient; and

• the gamma coefficient.

#### The cumulative accuracy profile and the accuracy ratio

The CAP is commonly known as the Gini curve, the power curve or the Lorenz curve. The CAP is a visual tool for two representative samples of scores for defaulted and nondefaulted borrowers (Cheng and Neamtiu 2009).

The validation of a rating model performance is done by comparing the CAP curve of the rating model with the CAP curve of a perfect model as well as the CAP curve of a random (imperfect) model (Gaillard 2014). The CAP curve of the rating model usually lies between that of the perfect model and the random model, occupying some part of the area formed by the CAP curve of the perfect model and that of the random model (Kalapodas and Thomson 2006). The percentage area occupied by the CAP curve of the rating model is called the AR. If the AR value is around 40–60%, a model is considered to have fair predictive power (Rezac and Rezac 2011).

Practical experience shows that the AR has a tendency to take values in the range of 50–80%. However, such observations should be interpreted with care, as they seem to strongly depend on the composition of the portfolio and the numbers of defaulters in the samples. Further, the shape of the CAP depends on the proportion of solvent and insolvent borrowers in the sample (Van Der Burgt 2008). Henceforth, a visual comparison of CAPs across different portfolios may be misleading. Given the above, the validation team should look at these specific aspects when validating the discrimination using the CAP curve and the AR:

• Are the obligors and their rating pools sorted from worst to best in terms of grades, with the cumulative number of obligors and the cumulative number of defaults found for each grade?

• Is the AR interpreted appropriately?

• Is the CAP compared with other CAPs appropriately?

• Is the perfect model appropriately specified? (A perfect rating model would be one in which all of the defaults emerge from the most inferior grade.)

• Are the defaults in the random model distributed appropriately (eg, equally)?

#### The receiver operating characteristic curve and the area under the curve

The construction of the ROC curve is based on possible distributions of rating scores for defaulting and nondefaulting debtors. In a perfect rating model, the defaulting and nondefaulting distributions should be separated (Bauer and Agarwal 2014). In the banking industry, however, a perfect rating system is not possible. Thus, banks are advised by Prorokowski (2016) to use a cutoff value in order to ensure that debtors graded below the cutoff point are flagged as potential defaulters while debtors with rating scores above this value are flagged as nondefaulters.

The ROC curve is a plot of the hit rate versus the false alarm rate. Fawcett (2004) explains that a rating model’s performance is better when the ROC curve’s position is closer to the $y$-axis. Similarly, the larger the AUROC curve, the better the model. A random model with no discriminative power would generate a ROC curve for which the AUROC would be 50% (Pepe 2000). By implication, in a perfect model, this area would be 100%. The area for a credit risk model is expected to range between 50% and 100%.

Marzban (2004) points to the fact that either the AR or the AUROC is a random variable. It is dependent on the portfolio structure, the number of defaulters, the number of rating grades and other factors. The metric is influenced by what it is measuring and should not be interpreted without knowledge of the underlying portfolio (Fan et al 2006). Therefore, the AR and the AUROC are not comparable across models developed from different samples. Given the above, the validation team should look at these specific aspects when validating the discrimination using the ROC curve, AUROC and the AR:

• Is the ROC curve interpreted correctly given the fact that banks can only report the confidence interval of the AR to test if a rating system has any discriminative power?

• Does the sample contain enough defaulters?

#### The Kappa coefficient

The Kappa statistic is obtained for measuring inter-rater reliability or determining the agreement between two rating models (Carletta 1996). This measure usually aims to compare the agreement between internal and external ratings or between rival models. However, as noted by Wu and Olson (2015), discrimination backtesting serves to ensure that risk is consistently rank-ordered by the model; hence, a perfect agreement between risk assessments is not requested. The Kappa coefficient is defined by the difference between the observed agreement between the two series of ratings and the “chance of agreement” probability. Kappa coefficients are interpreted using the guidelines outlined by Landis and Koch (1977). The validation team should look at these specific aspects when validating the discriminatory power using a Kappa coefficient:

• Is the strength of the Kappa coefficient interpreted correctly? (The value of 0.01–0.20 should be interpreted as “slight agreement”.)

• Is the probability of a random agreement deducted appropriately (ie, from the margin of the confusion matrix)?

#### The gamma coefficient

The gamma coefficient is a measure of rank correlation, calculating the difference between the number of concordant and discordant pairs of ratings issued by two different rating processes (eg, two different rating models). The gamma coefficient tells us the proportionate excess of concordant over discordant pairs among all pairs that are fully discriminated or ranked (Rousson 2007).

In practice, a pair of ratings is defined as two ratings $(R_{x1},R_{x2})$ assessed by the same process ($x$) on two different obligors. The pairs of ratings $(R_{x1},R_{x2})$ and $(R_{y1},R_{y2})$ assessed by two different processes ($x$ and $y$) are then compared. The pairs are concordant if $(R_{x1} and $(R_{y1} or $(R_{x1}>R_{x2})$ and $(R_{y1}>R_{y2})$. The pairs are discordant if $(R_{x1} and $(R_{y1}>R_{y2})$ or $(R_{x1}>R_{x2})$ and $(R_{y1}.

As noted by Altman and Sabato (2007), the gamma test for significance works like most other hypothesis tests, where the null hypothesis states that there is a difference in the populations. The value of the gamma coefficient is compared with the test statistics ($z$-critical from the standard normal $z$ tables). If the test statistic is higher, then one cannot reject the null hypothesis that there is a difference in the populations (Rousson 2007).

The validation team should understand that Goodman and Kruskal’s gamma is generally preferred for cases in which there are many tied ranks. It is also particularly useful when the data has outliers, as they do not affect the results to a great extent. The validation team should look at the following specific aspect when validating the discriminatory power using a gamma coefficient:

• Are there enough tied observations for ranks?

#### 2.3.1 Discrimination: empirical example

In practice, the discriminatory power of the midcorporate PD models is assessed by means of the AUROC, CAP and AR (Prorokowski 2016). As a rival statistical tool to the aforementioned methods, this paper proposes Goodman and Kruskal’s gamma to validate the discriminative power of a PD model. Applying the gamma statistical tool to the backtesting data sets submitted by participating banks (see the online appendix for the submitted backtesting data set that is validated using the selected statistical tools), this paper shows empirically the process of validating the discrimination backtest.

The validation team is advised to employ Goodman and Kruskal’s gamma in order to assess the consistency of allocating good ratings to less risky obligors and bad ratings to very risky obligors. At this point, the paper recommends stating the null hypothesis that the validated PD model discriminates between very risky and less risky obligors, with the gamma coefficient being applied to test this hypothesis. The value of the gamma coefficient is compared with the test statistics ($z$-critical in the standard normal $z$ tables). If the test statistic is higher, one cannot reject the null hypothesis. The gamma coefficient is specified by the following formula:

 $\gamma=\frac{N_{\mathrm{c}}-N_{\mathrm{d}}}{N_{\mathrm{c}}+N_{\mathrm{d}}}.$ (2.4)

Under the TRIM framework, the results of the gamma coefficient should be interpreted according to the predefined levels included in relevant internal policies (eg, backtesting policy). With this in mind, the paper suggests describing the gamma coefficient using a traffic light approach, as shown in Table 6.

Especially for the TRIM exercise, the validation team is required to demonstrate an understanding of the utilized statistical tools by implementing a framework containing specific actions related to various levels of the reported discrimination results. Against this backdrop, the paper suggests a PD model redevelopment for the orange and red traffic light flags.

As an empirical example, Goodman and Kruskal’s gamma is applied to the backtesting data set containing the sample credit portfolios of the midcorporate obligors in order to measure the discriminatory power of the PD models of Bank 1 and Bank 2. For the backtesting results of both PD models, the null hypothesis is tested by comparing the $z$-critical values from the statistical tables with the $z$-value computed in (2.5). The gamma coefficient’s $z$-value is compared with the $z$-critical values from the standard normal $z$ tables; if it is lower, one cannot reject the null hypothesis that there is a difference in the populations (the rating discriminates between very risky and less risky obligors):

 $z=G\sqrt{\frac{n_{\mathrm{c}}+n_{\mathrm{d}}}{N(1-G^{2})}}.$ (2.5)

In order to illustrate the use of the gamma coefficient for validating the discrimination backtest, the backtesting data set (see the online appendix) containing the PD estimates and associated ratings, assigned by the two models of Bank 1 and Bank 2, is divided into good and bad obligors. At this point, the distinction between good and bad obligors is set at the cutoff point of a noninvestment grade bond with a rating below BBB$-$. In doing so, the obligors with adequate capacity to meet their financial commitments are separated from the obligors that are vulnerable in the near term and facing major ongoing uncertainties, rendering them inadequate to meet their financial commitments. Table 7 presents the confusion matrix, with the cutoff point being a noninvestment grade rating of below a BBB$-$ grade.

An analysis of Table 7 reveals the following facts, placing the gamma coefficient in the green category of the predefined traffic lights.

Bank 1 PD model.
• The number of concordant pairs, $N_{\mathrm{c}}$, is 152 100.

• The number of discordant pairs, $N_{\mathrm{d}}$, is 28 215.

• The gamma coefficient, $\gamma$, is 0.687047.

Bank 2 PD model.
• The number of concordant pairs, $N_{\mathrm{c}}$, is 136 694.

• The number of discordant pairs, $N_{\mathrm{d}}$, is 26 520.

• The gamma coefficient, $\gamma$, is 0.675027.

Normally, interpreting the above values of the gamma coefficient would be sufficient to conclude on the model’s discriminatory power. However, under the TRIM framework, the validation team is required to report $z$-values for the calculated gamma coefficients in order to prove that the reported results are statistically significant. However, the $z$-values (6.756 for the Bank 1 PD model and 6.219 for the Bank 2 PD model) are significantly larger than the $z$-critical values from the standard normal $z$ tables. Thus, there are no grounds to assume the model discriminates between very risky and less risky obligors, and the null hypothesis can be rejected.

Given the above, this paper recommends incorporating Yule’s $Q$ as the measure of discriminatory power and association between bad/good differentiations and defaults. Yule’s $Q$ is always a number between $-1$ and $1$ and constitutes the statistical measure of discrimination for a $2\times 2$ confusion matrix. Unlike the gamma coefficient, Yule’s $Q$ does not require tests of significance, and hence the TRIM inspection does not insist on providing a proof of robustness for the Yule’s $Q$ findings. However, Yule’s $Q$ should be referenced according to predefined standards (eg, traffic lights). Table 8 provides suggestions for referencing Yule’s $Q$.

For the TRIM exercise, the validation team is required to demonstrate an understanding of the utilized Yule’s $Q$ tests by implementing a framework containing specific actions related to various levels of the reported discrimination results. Against this backdrop, this paper suggests redeveloping the PD model for orange and red traffic light flags as well as monitoring the discriminatory power of yellow flags with an assessment of the risk of deterioration to orange/red flags.

As an empirical example, Yule’s $Q$ is applied to a backtesting data set (see the online appendix) containing PD estimates and associated ratings assigned by the two models of Bank 1 and Bank 2. The cutoff point for noninvestment grade bonds, with a rating below BBB$-$, has been chosen to distinguish between good and bad obligors (Table 7). Given the fact that Yule’s $Q$ is equivalent to the gamma coefficient, the reported results are the same.

Bank 1 PD model.
• Yule’s $Q$ is 0.687047, indicating a very strong association between good/bad ratings and defaults (green flag of the predefined traffic lights).

Bank 2 PD model.
• Yule’s $Q$ is 0.675027, indicating a very strong association between good/bad ratings and defaults (green flag of the predefined traffic lights).

### 2.4 Stability

Stability backtesting measures the changes in a population over time that affect the appropriateness of an internal model (Lima et al 2011). Academics and practitioners recognize two types of stability: stability over time, discussed by Nickell et al (2000); and stability over groups, researched by Dambolena and Khoury (1980). As noted by De Haas and Van Lelyveld (2006), models are often intended to be used for predictions, but predictions are only valid if parameters are stable over time.

Stability backtesting of the PD model involves many aspects, for example, analyzing the stability of the portfolio on which the model is applied, the distribution of the variables of the underlying model, the model parameter stability and the stability of the model outputs (Mayne et al 2000). The stability can be assessed using a number of statistical measures:

• the population stability index (PSI);

• the Kruskal–Wallis test; and

• the rating migration matrix.

#### The population stability index

The change in the structure of the population is measured by the PSI. According to Wu and Olson (2010), the PSI aims to compare two series of observations classified into a number of risk classes. In doing so, the PSI quantifies shifts in population dynamics over time. As models are based on historical data sets, Sousa et al (2015) argue that it is necessary to ensure present-day population features are sufficiently similar to the historical population from which the model is developed. As explained by Karakoulas (2004), a higher PSI corresponds to greater shifts in population. Generally, PSI values of greater than 0.25 indicate a significant shift in the population, while values of less than 0.10 indicate a minimal shift in the population. Values of between 0.10 and 0.25 indicate a minor shift.

Comparing two populations visually is a good place to start with the PSI. However, the validation team should also look at these specific aspects when validating the stability using a PSI:

• Is the generally accepted formula for the PSI used? For example,

 $\text{PSI}=\sum\bigg{(}(\text{realized~{}PD\%}-\text{expected~{}PD\%})\bigg{(}% \ln\bigg{(}\frac{\text{realized~{}PD\%}}{\text{estimated~{}PD\%}}\bigg{)}\bigg% {)}\bigg{)}.$
• Is the PSI value interpreted correctly with respect to minimal, minor and significant shifts in the population?

#### The Kruskal–Wallis test

The academic literature recognizes that additional stability tests can be applied to test the stability of the variable over time, especially the Wilcoxon signed-rank test for comparing two different periods or the Friedman test for comparing several periods (MacFarland and Yates 2016). Notably, the Kruskal–Wallis test can be performed for independent information (Vargha and Delaney 1998). The Kruskal–Wallis test allows us to perform a distribution comparison between more than two samples. Most importantly, as noted by Feir-Walsh and Toothaker (1974), the variables do not need to be paired.

The Kruskal–Wallis test is a nonparametric (distribution free) test and is used when the assumptions of one-way analysis of variance (ANOVA) are not met (Hodges and Lehmann 1956). Both the Kruskal–Wallis test and one-way ANOVA assess significant differences between a continuous dependent variable and a categorical independent variable (with two or more groups; Chan and Walmsley 1997). When using the Kruskal–Wallis test, there is no need to assume the dependent variable is normally distributed, and there is approximately equal variance in the scores across groups (Lix et al 1996). Therefore, the Kruskal–Wallis test can be used for both continuous and ordinal-level dependent variables.

As highlighted by Hecke (2012), if the Kruskal–Wallis test’s value is close to 0, the average rank of a variable is close to $(n+1)/2$. Against this backdrop, each average rank of a given variable is close to the global average rank, so the variables would all be linked. As suggested by Hecke (2012), the higher the metric gets, the more significant the difference between at least two of the variables.

The validation team should understand that using the Kruskal–Wallis test is optional, as the test is a nonparametric test and hence not powerful. The validation team should also look at these specific aspects when validating the stability using a Kruskal–Wallis test:

• Is the null hypothesis stated correctly (ie, do the variables share the same distribution)?

• Is the null hypothesis rejected for a $p$-value below 5% with the correct conclusion that there is a difference between at least two variables?

#### The rating migration matrix

The reporting of migration matrixes for PD ratings offers information on the stability of the rating. A stable migration matrix and good discrimination indicate that the model is able to predict defaults already from a long time horizon (Jafry and Schuermann 2004). Unstable migration matrixes may indicate rather volatile ratings (Fei et al 2012). The rating migration matrix can be utilized without reporting any defaults within the time frame of the validation analysis (Gavalas and Syriopoulos 2014). The matrix serves to highlight any deterioration of PD estimates in the underlying portfolio.

All in all, Bangia et al (2002) argue that the rating migration matrix serves to capture changes to default probabilities in more detail. The validation team should understand that analysis using the migration matrix is only conducted on entities that have a PD estimate calculated on an ongoing basis for each consecutive time period in the backtesting sample. Thus, the newly added obligors that have not been rated previously should be excluded from the analysis, as there is no point of reference to past estimates calculated to these obligors.

Further, the validation should seek to justify the significant increase/decrease in the ratings of the obligors in the migration matrix. As noted by Prorokowski (2016), an attempt should be made to link the deterioration to economic factors. Therefore, the validation team should look at these specific aspects when validating the stability using migration matrixes for PD ratings:

• Is there a macroeconomic factor causing the rating migration?

• Are the new obligors removed from the sample?

• What is the total percentage of ratings that have been upgraded/downgraded during the analyzed time frame?

#### 2.4.1 Stability: empirical example

In practice, the stability of the midcorporate PD models is assessed by means of the PD rating migration matrix as well as analysis of the changing macroeconomic conditions and the changing business landscape. The paper suggests the PSI coupled with detailed rating migration matrixes as a rival statistical tool to these common methods.

The validation team is advised to apply PSI measures in order to investigate how much the underlying portfolio of obligors has changed over time, rendering a credit risk model obsolete and inadequate. The following formula for the PSI is proposed:

 $\text{PSI}=\sum_{i}\bigg{(}(\text{PD}_{i,\mathrm{actual}}\%-\text{PD}_{i,% \mathrm{expected}}\%)\bigg{(}\ln\bigg{(}\frac{\text{PD}_{i,\mathrm{actual}}\%}% {\text{PD}_{i,\mathrm{expected}}\%}\bigg{)}\bigg{)}\bigg{)},$ (2.6)

where $i$ denotes the rating class (eg, AAA), $\text{PD}_{i,\mathrm{actual}}$ is the actual distribution of obligors per rating $i$ and $\text{PD}_{i,\mathrm{expected}}$ is the expected distribution of obligors per rating $i$ taken from the reference year (that is, the previous year).

Under the TRIM framework, the results of the PSI should be explained with reference to predefined levels, which allow for actionable decisions to be taken. Against this backdrop, the paper suggests referencing the traffic light thresholds when describing the PSI, as shown in Table 9.

Normally, population stability is not considered a prerequisite for models to exhibit good risk measurement and predictive power abilities. Some credit risk models are developed on different portfolios due to data unavailability and other constraints faced at the model-building phase. Further, the portfolio of obligors can change due to the new investment strategies of a bank or regulatory and business requirements. Thus, the primary aim of the PSI is to monitor, track and highlight changes in the underlying portfolio. However, the TRIM exercise requires that credit risk models reflect any changes to the underlying portfolios of obligors. Models assigned to the orange and red traffic light categories for the PSI would be rejected or challenged by the TRIM inspectors.

As an empirical example, the PSI is applied as per its rating class to the portfolio of midcorporate obligors contained in the online appendix. Table 10 shows the PSI calculated for the Bank 1 PD model and Table 11 shows the PSI calculated for the Bank 2 PD model. The PSI results are presented with reference to the traffic light flags (see the note to Table 10).

The population stability analysis reveals that 2013 saw a minor shift in the population, with elevated PSI levels reported for both models in this year. This similarity in PSI levels is expected, as the PD models are applied to the same portfolio of obligors. Especially for the TRIM requirements, the validation team should provide comments on the instances where a yellow traffic flag (Y) is applied to the reported PSI values. Further, a graphical illustration of the rating distributions per year should accompany a population stability analysis in order to satisfy the TRIM inspection.

## 3 Practical recommendations

Summarizing the previous section, several recommendations for validation teams are put forward in this paper to address the TRIM requirements. As a prerequisite, the validation team should look into the quality of the data as well as the directory and databases being used to create the backtesting data set. The validation team should perform the following steps:

• confirm whether the backtesting data set is subject to a data policy with defined standards for ensuring data quality;

• assess whether these data quality standards are adequate for the backtesting data set and suggest policy/standard improvements if needed;

• assess the alignment of the creation process for the backtesting data set with data quality standards;

• confirm whether a data directory exists that provides all definitions of the data items in the backtesting data set;

• check the data directory for completeness and accuracy regarding all definitions of data items;

• provide a general description of the databases used to obtain the backtesting data set (including but not limited to information about the relational database model, data owners, data sources, data processes and data controls); and

• provide a graphical illustration of the relationships between data sources, databases and the backtesting data set.

In addition, for a detailed validation of the backtesting data set, we advise that the validation team analyze the type of data used. This is due to the fact that the TRIM guidelines require external data to be subject to similar quality controls to those applied to internal data. Hereto, the validation team should perform the following steps:

• confirm if the data used for the creation of the backtesting data set stems from internal or external sources; and

• assess if the choice of using external data is justified (eg, the external data provides more objectivity or is more granular).

For the assessment of backtesting frequency, the validation should understand that the TRIM guidelines require backtesting to be conducted at least annually. With this in mind, the validation team should perform the following steps:

• confirm the appropriateness of the backtesting frequency (at least once annually);

• confirm the appropriateness of the size of the observation window in order to ensure the poor performance of the model is not masked; and

• assess if the number of grades/ratings allows for a meaningful differentiation of risk.

As far as the validation of the calibration backtest is concerned, for the analysis of the discrepancies between the internal PD model’s estimates and the realized observations of defaults, the validation team is advised to perform the following steps:

• confirm if the grades/ratings are composed of homogeneous groups of obligors (PD);

• assess if the inclusion of the most recent data affects the model output to a significant degree;

• assess the difference between the estimates and realized observations;

• confirm if there is an undue concentration in a grading/rating that might affect the calibration backtest;

• confirm the appropriateness of the hypotheses stated for the applied statistical tests of calibration;

• assess the power of the calibration test given the assumptions regarding the independence of defaults;

• arrange the data in accordance with the requirements of a particular calibration test (eg, the data should be rank-ordered for the Hosmer–Lemeshow test);

• confirm that there are enough observations to obtain meaningful results; and

• confirm if the chosen test should be completed with other tests.

For the analysis of the discriminatory power of the PD models, the TRIM guidelines state that the validation should encompass the PD model as well as individual risk factors and other subsets related to the analyzed model. Against this backdrop, the validation team is advised to perform the following steps while validating the discriminatory power:

• interpret the curves with the associated statistics (eg, AR);

• confirm if the data set contains a sufficient number of defaults in order to make meaningful conclusions;

• differentiate between model discrimination and model accuracy (eg, differentiate between the correlation statistics indicating discriminatory power and the model’s accuracy);

• confirm if the probability of the random agreement is derived appropriately; and

• confirm the appropriateness of the discrimination test given the number of tied ranks.

For validating the stability backtest, the TRIM guidelines require two types of analysis to be conducted. First, the validation should assess the stability of the ratings and risk factors over time and provide detailed justifications for any deviations. Second, under the TRIM rules, the stability of the PD model should be checked, with attention being paid to reviewing the weights assigned to input variables. With this in mind, the validation team should perform the following steps:

• assess the power of the statistical test based on its assumptions and determine if additional analysis is required to reach a conclusion on the stability;

• explain, using macroeconomic indicators, any changes in the PD ratings; and

• assess both the stability of model inputs (factors) and model outputs over time.

Finally, the TRIM inspection requires the validation team to use predefined thresholds and to calculate an expected range of estimation error for every parameter estimate. This paper provides reference mapping to the used statistical metrics and advocates the use of the traffic light approach in this space. However, it is at the credit institution’s discretion to develop its own predefined thresholds for the reported statistical measures. At this point, for the purpose of the TRIM exercise, a margin of conservatism should be maintained when referencing any predefined levels to validation outcomes.

## 4 Conclusions

This paper has provided an explicit view of the statistical tests and additional analyses the validation team should utilize when validating the backtesting process under the TRIM guidelines.

It is assumed in this paper that the backtesting process consists of three major parts.

Calibration:

assessment of the deviation of an internal model’s estimates from realized observations.

Discrimination:

assessment of the extent to which bad realized observations are assigned low internal model estimates and good realized observations are assigned high ones.

Stability:

assessment of the changes in the population over time that affect the appropriateness of the internal model.

The suggestions put forward in this paper are not exhaustive, and the validation team should not be limited to utilizing only the tests mentioned in this study. Further, one should expect that the recommendations contained herein are subject to future review, based on how quickly the regulatory landscape is changing.

These recommendations for the validation of the backtesting process do not include clearly formulated decision criteria for the treatment of statistical tests’ results, which would lead to model recalibration or redevelopment. With this in mind, a future study could address this gap, with practical implications for the banking industry.

The scope of this paper is limited to PD models. However, as noted by the Basel Committee on Banking Supervision (2005b), LGD models should also be backtested on a regular basis. Overall, the validation of LGD models includes the examination of both the model design and the LGD estimates produced for the key risk parameters of the validated model. Current backtesting practices in the empirical LGD literature are usually limited to comparing internal LGD predictions against realized LGD observations with error-based metrics, correlation-based metrics or even classification-based metrics. Therefore, a future study providing practical recommendations for validating LGD backtests is recommended.

## Declaration of interest

The author reports no conflicts of interest. The author alone is responsible for the content and writing of the paper.

## References

• Allen, L., DeLong, G., and Saunders, A. (2004). Issues in the credit risk modeling of retail markets. Journal of Banking & Finance 28(4), 727–752 (https://doi.org/10.1016/S0378-4266(03)00197-3).
• Altman, E. I., and Sabato, G. (2007). Modelling credit risk for SMEs: evidence from the US market. Abacus 43(3), 332–357 (https://doi.org/10.1111/j.1467-6281.2007.00234.x).
• Andersen, E. B. (1973). A goodness of fit test for the Rasch model. Psychometrika 38(1), 123–140 (https://doi.org/10.1007/BF02291180).
• Archer, K. J., Lemeshow, S., and Hosmer, D. W. (2007). Goodness-of-fit tests for logistic regression models when data are collected using a complex sampling design. Computational Statistics & Data Analysis 51(9), 4450–4464 (https://doi.org/10.1016/j.csda.2006.07.006).
• Bangia, A., Diebold, F. X., Kronimus, A., Schagen, C., and Schuermann, T. (2002). Ratings migration and the business cycle, with application to credit portfolio stress testing. Journal of Banking & Finance 26(2–3), 445–474 (https://doi.org/10.1016/S0378-4266(01)00229-1).
• Batislam, E. P., Denizel, M., and Filiztekin, A. (2007). Empirical validation and comparison of models for customer base analysis. International Journal of Research in Marketing 24(3), 201–209 (https://doi.org/10.1016/j.ijresmar.2006.12.005).
• Bauer, J., and Agarwal, V. (2014). Are hazard models superior to traditional bankruptcy prediction approaches? A comprehensive test. Journal of Banking & Finance 40(1), 432–442 (https://doi.org/10.1016/j.jbankfin.2013.12.013).
• Basel Committee on Banking Supervision (2001). The new Basel Capital Accord. Consultative Document, May 31, Bank for International Settlements.
• Basel Committee on Banking Supervision (2005a). An explanatory note on the Basel II IRB risk weight functions. Note, July, Bank of International Settlements.
• Basel Committee on Banking Supervision (2005b). Studies on the validation of internal rating systems. Working Paper 14, May, Bank of International Settlements.
• Basel Committee on Banking Supervision (2010). Sound practices for backtesting counterparty credit risk models. Note, December, Bank of International Settlements.
• Berkowitz, J., and O’Brien, J. (2002). How accurate are value-at-risk models at commercial banks? Journal of Finance 57(3), 1093–1111 (https://doi.org/10.1111/1540-6261.00455).
• Bernstein, M. E. (1962). The binomial test. American Journal of Human Genetics 14(4), 435–442.
• Bikker, J. A., and Metzemakers, P. A. (2005). Bank provisioning behaviour and procyclicality. Journal of International Financial Markets, Institutions and Money 15(2), 141–157 (https://doi.org/10.1016/j.intfin.2004.03.004).
• Blochwitz, S., Hohl, S., Tasche, D., and Wehn, C. S. (2004). Validating default probabilities on short time series. Capital & Market Risk Insights, Federal Reserve Bank of Chicago.
• Breiman, L. (1996). Bagging predictors. Machine Learning 24(2), 123–140 (https://doi.org/10.1007/BF00058655).
• Campbell, S. D. (2006). A review of backtesting and backtesting procedures. The Journal of Risk 9(2), 1–17.
• Carey, M., and Hrycay, M. (2001). Parameterizing credit risk models with rating data. Journal of Banking & Finance 25(1), 197–270 (https://doi.org/10.1016/S0378-4266(00)00124-2).
• Carlehed, M., and Petrov, A. (2012). A methodology for point-in-time–through-the-cycle probability of default decomposition in risk classification systems. The Journal of Risk Model Validation 6(3), 3–25 (https://doi.org/10.21314/JRMV.2012.091).
• Carletta, J. (1996). Assessing agreement on classification tasks: the kappa statistic. Computational Linguistics 22(2), 249–254.
• Castermans, G., Martens, D., Van Gestel, T., Hamers, B., and Baesens, B. (2010). An overview and framework for PD backtesting and benchmarking. Journal of the Operational Research Society 61(3), 359–373 (https://doi.org/10.1057/jors.2009.69).
• Chan, Y., and Walmsley, R. P. (1997). Learning and understanding the Kruskal–Wallis one-way analysis-of-variance-by-ranks test for differences among three or more independent groups. Physical Therapy 77(12), 1755–1761 (https://doi.org/10.1093/ptj/77.12.1755).
• Cheng, M., and Neamtiu, M. (2009). An empirical analysis of changes in credit rating properties: timeliness, accuracy and volatility. Journal of Accounting and Economics 47(1–2), 108–130 (https://doi.org/10.1016/j.jacceco.2008.11.001).
• Chorniy, V., and Lordkipanidze, B. (2015). Backtesting: are we missing the big picture? Conference Paper presented at the WBS 11th Fixed Income Conference, October 8, Paris.
• Dambolena, I. G., and Khoury, S. J. (1980). Ratio stability and corporate failure. Journal of Finance 35(4), 1017–1026 (https://doi.org/10.1111/j.1540-6261.1980.tb03517.x).
• Das, S. R., Duffie, D., Kapadia, N., and Saita, L. (2007). Common failings: how corporate defaults are correlated. Journal of Finance 62(1), 93–117 (https://doi.org/10.1111/j.1540-6261.2007.01202.x).
• De Haas, R., and Van Lelyveld, I. (2006). Foreign banks and credit stability in Central and Eastern Europe. A panel data analysis. Journal of Banking & Finance 30(7), 1927–1952 (https://doi.org/10.1016/j.jbankfin.2005.07.007).
• Engelmann, B., and Rauhmeier, R. (eds) (2006). The Basel II Risk Parameters: Estimation, Validation, and Stress Testing. Springer (https://doi.org/10.1007/3-540-33087-9).
• Engelmann, B., Hayden, E., and Tasche, D. (2003). Testing rating accuracy. Risk 16(1), 82–86.
• European Banking Authority (2016). Final draft regulatory technical standards on the specification of the assessment methodology for competent authorities regarding compliance of an institution with the requirements to use the IRB approach in accordance with Articles 144(2), 173(3) and 180(3)(b) of Regulation (EU) No 575/2013. Report, July 21, European Banking Authority.
• European Central Bank (2017). What is the targeted review of internal models? In ECB Guide to Banking Supervision. European Central Bank.
• Fan, J., Upadhye, S., and Worster, A. (2006). Understanding receiver operating characteristic (ROC) curves. Canadian Journal of Emergency Medicine 8(1), 19–20 (https://doi.org/10.1017/S1481803500013336).
• Fawcett, T. (2004). ROC graphs: notes and practical considerations for researchers. Machine Learning 31(1), 1–38.
• Fei, F., Fuertes, A. M., and Kalotychou, E. (2012). Credit rating migration risk and business cycles. Journal of Business Finance & Accounting 39(1–2), 229–263 (https://doi.org/10.1111/j.1468-5957.2011.02272.x).
• Feir-Walsh, B. J., and Toothaker, L. E. (1974). An empirical comparison of the ANOVA F-test, normal scores test and Kruskal–Wallis test under violation of assumptions. Educational and Psychological Measurement 34(4), 789–799 (https://doi.org/10.1177/001316447403400406).
• Foster, D. P., and Stine, R. A. (2004). Variable selection in data mining: building a predictive model for bankruptcy. Journal of the American Statistical Association 99(466), 303–313 (https://doi.org/10.1198/016214504000000287).
• Gaillard, N. (2014). What is the value of sovereign ratings? German Economic Review 15(1), 208–224 (https://doi.org/10.1111/geer.12018).
• Gavalas, D., and Syriopoulos, T. (2014). Bank credit risk management and rating migration analysis on the business cycle. International Journal of Financial Studies 2(1), 122–143 (https://doi.org/10.3390/ijfs2010122).
• Gordy, M. B. (2000). A comparative anatomy of credit risk models. Journal of Banking & Finance 24(1–2), 119–149 (https://doi.org/10.1016/S0378-4266(99)00054-0).
• Harmantzis, F. C., Miao, L., and Chien, Y. (2006). Empirical study of value-at-risk and expected shortfall models with heavy tails. Journal of Risk Finance 7(2), 117–135 (https://doi.org/10.1108/15265940610648571).
• Hecke, T. V. (2012). Power study of ANOVA versus Kruskal–Wallis test. Journal of Statistics and Management Systems 15(2–3), 241–247 (https://doi.org/10.1080/09720510.2012.10701623).
• Hodges, J. L., and Lehmann, E. L. (1956). The efficiency of some nonparametric competitors of the $t$-test. Annals of Mathematical Statistics 27(2), 324–335 (https://doi.org/10.1214/aoms/1177728261).
• Hosmer, D. W., Hosmer, T., Le Cessie, S., and Lemeshow, S. (1997). A comparison of goodness-of-fit tests for the logistic regression model. Statistics in Medicine 16(9), 965–980 (https://doi.org/10.1002/(SICI)1097-0258(19970515)16:9<965::AID-SIM509>3.0.CO;2-O).
• Jackson, P., and Perraudin, W. (2000). Regulatory implications of credit risk modelling. Journal of Banking & Finance 24(1–2), 1–14 (https://doi.org/10.1016/S0378-4266(99)00050-3).
• Jafry, Y., and Schuermann, T. (2004). Measurement, estimation and comparison of credit migration matrices. Journal of Banking & Finance 28(11), 2603–2639 (https://doi.org/10.1016/j.jbankfin.2004.06.004).
• Jones, D., and Mingo, J. (1999). Credit risk modeling and internal capital allocation processes: implications for a models-based regulatory bank capital standard. Journal of Economics and Business 51(2), 79–108 (https://doi.org/10.1016/S0148-6195(98)00030-7).
• Kalapodas, E., and Thomson, M. E. (2006). Credit risk assessment: a challenge for financial institutions. IMA Journal of Management Mathematics 17(1), 25–46 (https://doi.org/10.1093/imaman/dpi026).
• Kao, D. L. (2000). Estimating and pricing credit risk: an overview. Financial Analysts Journal 56(4), 50–66 (https://doi.org/10.2469/faj.v56.n4.2373).
• Karakoulas, G. (2004). Empirical validation of retail credit-scoring models. Retail Risk Management Journal 87(1), 56–60.
• Kealhofer, S. (2003). Quantifying credit risk I: default prediction. Financial Analysts Journal 59(1), 30–44 (https://doi.org/10.2469/faj.v59.n1.2501).
• King, G., Tomz, M., and Wittenberg, J. (2000). Making the most of statistical analyses: improving interpretation and presentation. American Journal of Political Science 44(1), 347–361 (https://doi.org/10.2307/2669316).
• Kruppa, J., Schwarz, A., Arminger, G., and Ziegler, A. (2013). Consumer credit risk: individual probability estimates using machine learning. Expert Systems with Applications 40(13), 5125–5131 (https://doi.org/10.1016/j.eswa.2013.03.019).
• Landis, J. R., and Koch, G. G. (1977). An application of hierarchical kappa-type statistics in the assessment of majority agreement among multiple observers. Biometrics 33(2), 363–374 (https://doi.org/10.2307/2529786).
• Lando, D. (2009). Credit Risk Modeling: Theory and Applications. Princeton University Press, Princeton, NJ (https://doi.org/10.1007/978-3-540-71297-8_35).
• Lemeshow, S., and Hosmer, D. W., Jr. (1982). A review of goodness of fit statistics for use in the development of logistic regression models. American Journal of Epidemiology 115(1), 92–106 (https://doi.org/10.1093/oxfordjournals.aje.a113284).
• Lichtenstein, S., and Fischhoff, B. (1980). Training for calibration. Organizational Behavior and Human Performance 26(2), 149–171 (https://doi.org/10.1016/0030-5073(80)90052-5).
• Lima, E., Mues, C., and Baesens, B. (2011). Monitoring and backtesting churn models. Expert Systems with Applications 38(1), 975–982 (https://doi.org/10.1016/j.eswa.2010.07.091).
• Lix, L. M., Keselman, J. C., and Keselman, H. J. (1996). Consequences of assumption violations revisited: a quantitative review of alternatives to the one-way analysis of variance $F$ test. Review of Educational Research 66(4), 579–619.
• Longstaff, F. A., Mithal, S., and Neis, E. (2005). Corporate yield spreads: default risk or liquidity? New evidence from the credit default swap market. Journal of Finance 60(5), 2213–2253 (https://doi.org/10.1111/j.1540-6261.2005.00797.x).
• Lopez, J. A., and Saidenberg, M. R. (2000). Evaluating credit risk models. Journal of Banking & Finance 24(1–2), 151–165 (https://doi.org/10.1016/S0378-4266(99)00055-2).
• Lore, M., and Borodovsky, L. (2000). Professional’s Handbook of Financial Risk Management. Reed Educational and Professional, Oxford.
• Loterman, G., Debruyne, M., Branden, K. V., Van Gestel, T., and Mues, C. (2014). A proposed framework for backtesting loss given default models. The Journal of Risk Model Validation 8(1), 69–90 (https://doi.org/10.21314/JRMV.2014.117).
• MacFarland, T. W., and Yates, J. M. (2016). Wilcoxon matched-pairs signed-ranks test. In Introduction to Nonparametric Statistics for the Biological Sciences Using R. Springer (https://doi.org/10.1007/978-3-319-30634-6_5).
• Maltezos, S. (2017). Targeted review of internal models – TRIM – the foundation of model risk management regulation. SAS Blog, October 24.
• Martens, D., Van Gestel, T., De Backer, M., Haesen, R., Vanthienen, J., and Baesens, B. (2010). Credit rating prediction using ant colony optimization. Journal of the Operational Research Society 61(4), 561–573 (https://doi.org/10.1057/jors.2008.164).
• Martin, R., and Wilde, T. (2002). Credit portfolio measurement of unsystematic credit risk. Risk 15(11), 123–128.
• Marzban, C. (2004). The ROC curve and the area under it as performance measures. Weather and Forecasting 19(6), 1106–1114 (https://doi.org/10.1175/825.1).
• Matias-Gama, A. P., and Amaral-Geraldes, S. H. (2012). Credit risk assessment and the impact of the new Basel Capital Accord on small and medium-sized enterprises: an empirical analysis. Management Research Review 35(8), 727–749 (https://doi.org/10.1108/01409171211247712).
• Mayne, D. Q., Rawlings, J. B., Rao, C. V., and Scokaert, P. O. (2000). Constrained model predictive control: stability and optimality. Automatica 36(6), 789–814 (https://doi.org/10.1016/S0005-1098(99)00214-9).
• Miu, P., and Ozdemir, B. (2006). Basel requirement of downturn LGD: modeling and estimating PD and LGD correlations. The Journal of Credit Risk 2(2), 43–68 (https://doi.org/10.21314/JCR.2006.037).
• Morone, M., and Cornaglia, A. (2009). Rating philosophy and dynamic properties of internal rating systems: a general framework and an application to backtesting. The Journal of Risk Model Validation 3(4), 61–88 (https://doi.org/10.21314/JRMV.2009.047).
• Nickell, P., Perraudin, W., and Varotto, S. (2000). Stability of rating transitions. Journal of Banking & Finance 24(1–2), 203–227 (https://doi.org/10.1016/S0378-4266(99)00057-6).
• Nolde, N., and Ziegel, J. F. (2017). Rejoinder: elicitability and backtesting: perspectives for banking regulation. Annals of Applied Statistics 11(4), 1901–1911 (https://doi.org/10.1214/17-AOAS1041F).
• Pepe, M. S. (2000). Receiver operating characteristic methodology. Journal of the American Statistical Association 95(449), 308–311 (https://doi.org/10.1080/01621459.2000.10473930).
• Prorokowski, L. (2016). Rank-order statistics for validating discriminative power of credit risk models. Bank i Kredyt 47(3), 227–250.
• Prorokowski, L., and Prorokowski, H. (2014). Organisation of compliance across financial institutions. Journal of Investment Compliance 15(1), 65–76 (https://doi.org/10.1108/JOIC-12-2013-0041).
• Prudential Regulation Authority (2013). Internal ratings based approaches. Supervisory Statement SS11/13, PRA, Bank of England.
• Raymond, R., and Steffen, P. (2017). Targeted review of internal models. Zanders, November 1.
• Rezac, M., and Rezac, F. (2011). How to measure the quality of credit scoring models. Czech Journal of Economics and Finance 61(5), 486–507.
• Rösch, D. (2005). An empirical comparison of default risk forecasts from alternative credit rating philosophies. International Journal of Forecasting 21(1), 37–51 (https://doi.org/10.1016/j.ijforecast.2004.04.001).
• Rousson, V. (2007). The gamma coefficient revisited. Statistics & Probability Letters 77(17), 1696–1704 (https://doi.org/10.1016/j.spl.2007.04.009).
• Rufibach, K. (2010). Use of Brier score to assess binary predictions. Journal of Clinical Epidemiology 63(8), 938–939 (https://doi.org/10.1016/j.jclinepi.2009.11.009).
• Schnitzler, S., Rother, N., Plank, H., and Glössner, P. (2014). Backtesting for counterparty credit risk. The Journal of Risk Model Validation 8(4), 3–17 (https://doi.org/10.21314/JRMV.2014.129).
• Schuermann, T., and Hanson, S. G. (2004). Estimating probabilities of default. Staff Report 190, Federal Reserve Bank of New York.
• Shapiro, S. S., and Francia, R. S. (1972). An approximate analysis of variance test for normality. Journal of the American Statistical Association 67(337), 215–216 (https://doi.org/10.1080/01621459.1972.10481232).
• Sousa, M. R., Gama, J., and Brandão, E. (2015). Links between scores, real default and pricing: evidence from the Freddie Mac’s loan-level data set. Journal of Economics, Business and Management 3(12), 1106–1114 (https://doi.org/10.7763/JOEBM.2015.V3.343).
• Spiegelhalter, D. J. (1986). Probabilistic prediction in patient management and clinical trials. Statistics in Medicine 5(5), 421–433 (https://doi.org/10.1002/sim.4780050506).
• Spiegelhalter, D. J., Best, N. G., Carlin, B. P., and Van Der Linde, A. (2002). Bayesian measures of model complexity and fit. Journal of the Royal Statistical Society B 64(4), 583–639 (https://doi.org/10.1111/1467-9868.00353).
• Steyerberg, E. W., Vickers, A. J., Cook, N. R., Gerds, T., Gonen, M., Obuchowski, N., and Kattan, M. W. (2010). Assessing the performance of prediction models: a framework for some traditional and novel measures. Epidemiology 21(1), 128–138 (https://doi.org/10.1097/EDE.0b013e3181c30fb2).
• Tasche, D. (2003). A traffic lights approach to PD validation. Library Paper, Cornell University. URL: https://arxiv.org/abs/cond-mat/0305038.
• Tasche, D. (2008). Validation of internal rating systems and PD estimates. Quantitative Finance (Special Issue), 169–196.
• Upper, C., and Worms, A. (2004). Estimating bilateral exposures in the German interbank market: is there a danger of contagion? European Economic Review 48(4), 827–849 (https://doi.org/10.1016/j.euroecorev.2003.12.009).
• Van Der Burgt, M. (2008). Calibrating low-default portfolios, using the cumulative accuracy profile. The Journal of Risk Model Validation 1(4), 17–33 (https://doi.org/10.21314/JRMV.2008.016).
• Van Gestel, T., Baesens, B., Van Dijcke, P., Garcia, J., Suykens, J. A., and Vanthienen, J. (2006). A process model to develop an internal rating system: sovereign credit ratings. Decision Support Systems 42(2), 1131–1151 (https://doi.org/10.1016/j.dss.2005.10.001).
• Van Vuuren, G., de Jongh, R., and Verster, T. (2017). The impact of PD–LGD correlation on expected loss and economic capital. International Business & Economics Research Journal 16(3), 157–170 (https://doi.org/10.19030/iber.v16i3.9975).
• Vargha, A., and Delaney, H. D. (1998). The Kruskal–Wallis test and stochastic homogeneity. Journal of Educational and Behavioral Statistics 23(2), 170–192 (https://doi.org/10.3102/10769986023002170).
• Vasicek, O. (1976). A test for normality based on sample entropy. Journal of the Royal Statistical Society B 38(1), 54–59 (https://doi.org/10.1111/j.2517-6161.1976.tb01566.x).
• Wang, S. (1998). Aggregation of correlated risk portfolios: models and algorithms. Proceedings of the Casualty Actuarial Society 85(163), 848–939.
• Wu, D., and Olson, D. L. (2010). Enterprise risk management: coping with model risk in a large bank. Journal of the Operational Research Society 61(2), 179–190 (https://doi.org/10.1057/jors.2008.144).
• Wu, D., and Olson, D. L. (2015). Bank credit scoring. In Enterprise Risk Management in Finance. Palgrave Macmillan, London.
• Yazici, B., and Yolacan, S. (2007). A comparison of various tests of normality. Journal of Statistical Computation and Simulation 77(2), 175–183 (https://doi.org/10.1080/10629360600678310).

Only users who have a paid subscription or are part of a corporate subscription are able to print or copy content.

To access these options, along with all other subscription benefits, please contact [email protected] or view our subscription options here: http://subscriptions.risk.net/subscribe