# Journal of Investment Strategies

**ISSN:**

2047-1238 (print)

2047-1246 (online)

**Editor-in-chief:** Ali Hirsa

# Systematic testing of systematic trading strategies

Kovlin Perumal and Emlyn Flint

####
Need to know

- Statistical methodologies to eradicate data-mining bias (DMB) were critically evaluated under controlled conditions.
- The relationships between DMB present against the variables affecting market conditions were identified.
- Under our simulation process, the methods of White’s Reality Check and Monte-Carlo Permutation were shown to be the most effective in eradicating data-mining bias. The Step-M procedure was also shown to have merit.
- Bound manipulation methods were shown to be inferior to those that directly produce
*p*-values.

####
Abstract

Systematic trading is a method that is currently extremely popular in the investment world. The testing of systematic trading rules is usually done through backtesting and is at high risk of spurious accuracy as a result of the data-mining bias (DMB) present when testing multiple rules concurrently over the same historical period. The eradication of this DMB through the use of statistical methodologies is currently a relevant topic in investment research, as is illustrated by papers written by Chordia, Goyal and Saretto in 2017, by Harvey and Liu in 2014, by Novy-Marx in 2016 and by Peterson in 2015. This study reviews the various statistical methodologies that are in place to test multiple systematic trading strategies and implements these methodologies under simulation with known artificial trading rules in order to critically compare and evaluate them.

####
Introduction

## 1 Introduction

Systematic trading is currently an extremely popular method in the investment world. It is a process whereby a buy or sell signal is generated from a rules-based quantitative process with the aim of meeting an investor’s objectives given a defined constraint set. These systematic trading rules are evaluated through the method of backtesting, which involves the historic simulation of an algorithmic investment strategy (Bailey et al 2014).

However, evaluation done in this way is at high risk of spurious accuracy as a result of the data-mining bias (DMB) present when testing multiple rules concurrently over the same historical period. This makes it difficult to distinguish between genuinely superior rules and false discoveries that happen to get lucky over the particular backtest. This problem results in resources being put into trading rules that perform well in-sample (in the backtest) but are unprofitable out-of-sample (in the real world).

DMB is eradicated through the use of various statistical methodologies. Solving the problem of DMB through the implementation of these statistical methodologies is currently highly relevant. This is illustrated by the large number of studies into the subject in recent years. Examples of such research include papers by Chordia et al (2017), Harvey and Liu (2014), Novy-Marx (2016) and Peterson (2015). The papers listed advocate the use of these statistical methods to eradicate DMB and will serve as the inspiration for our research.

Chordia et al (2017) highlights the problems associated with data mining (or “p-hacking”) by practically evaluating 2.1 million trading strategies based on empirical fundamental stock data using a variety of multiple hypothesis tests. The tests include the Bonferroni correction, White’s reality check, controlling the false discovery rate and various iterations of the step-M method. The authors found that, although these strategies performed well according to conventional tests, very few strategies survived the multiple hypothesis evaluation. The surviving strategies were also found to have no apparent theoretical underpinnings according to the various additional economic hurdles employed.

Novy-Marx (2016) concludes that systematic multisignal trading strategies cannot be evaluated using conventional tests. Through the conventional evaluation of multisignal strategies, selected using purely random signals, they show that DMB makes it very easy to produce strategies with excellent backtested performance that, by construction, have no power. Novy-Marx derives different test statistics and suggests they be used in order to remove the effects of DMB.

Harvey and Liu (2014) practically displays the ease with which seemingly profitable strategies can be produced by exploiting the effects of DMB through the use of multiple illustrative examples. The authors conclude that researchers must be aware of the problems associated with DMB and should make use of multiple hypothesis testing frameworks. In particular, they suggest that the controlling the family-wise error rate and false discovery rate methods should be used.

Our research provides a unique perspective on the matter by moving away from highlighting the problems with DMB and using empirical examples of trading strategies. Instead, our study focuses on critically evaluating the statistical methods used to eradicate DMB, with the aim of discovering which method performs best in the context of systematic trading strategies. To achieve this goal, rather than using empirical example strategies, we make use of artificial trading rules, simulated with controlled variables. This approach makes it possible to analyze the effects of each simulated market and strategy variable on the magnitude of DMB observed as well as test the statistical methods over a wide range of market and strategy scenarios.

A contradictory paper by Harris (2016) states that this type of quantitative evaluation is obsolete. Harris claims that studying drastic changes in market conditions and the times at which they occur is fundamentally more important. Harris rationalizes this by stating that these market changes cause the largest gains, the largest losses and the most invalidations with regard to trading strategies. This may be true to a degree, but statistical methods can still be used to identify profitable trading strategies in current market conditions. These conditions can be assumed to stay constant over short periods as drastic market changes occur infrequently. The underlying market characteristics that allow for strategy profitability can then be identified through analysis and monitored for any change. This shows that quantitative claims made through the use of statistical methodologies are still meaningful.

The remainder of this study is set out as follows. Section 2 provides an overview of the market and strategy simulation framework as well as the statistical methods to be tested. Section 3 briefly describes the simulation process. The results of the study are presented in Section 4 and are split into two parts: the first focuses on the effect of each simulation parameter on the total magnitude of DMB observed; and the second presents the general performance of each statistical method across various market and strategy scenarios. Section 5 concludes the study.

## 2 Methodology and data

The tests performed in this paper involve simulation of both artificial market data and artificial trading rules (ATRs) with known combinations of parameters. The general effects of changing these parameters on the magnitude of DMB is measured. The specific effects of changing these parameters on relative statistical method performance is also measured. The performance of each statistical method is then evaluated by comparing the number of false discoveries made by each one. The average $p$-values obtained from each method are compared with strategy profitability to give an indication of relative power. This $p$-value refers to the probability that the mean return produced by the simulated ATR for a simulated market path and a single combination of parameters is equal to zero. The artificial market, ATRs and parameters associated with each will now be outlined.

### 2.1 Market data simulation

In order to realistically simulate market data, the Merton jump diffusion model is selected as the simulation model. This model captures two key features of real-world markets that are deemed important: namely, the variance of market returns and the presence of jumps. Table 1 provides an overview of the selected market parameter ranges.

Parameter | Symbol | Possible values |
---|---|---|

Drift of diffusion | ${u}_{\mathrm{d}}$ | 0 |

Diffusion standard deviation | ${\sigma}_{\mathrm{d}}$ | 0.1, 0.3, 0.5 |

Poisson arrival rate | $\lambda $ | 0, 0.1, 0.2 |

Drift of jump | ${u}_{\mathrm{j}}$ | 0 |

Jump standard deviation | ${\sigma}_{\mathrm{j}}$ | 0.01 |

Time step (months) | $\mathrm{d}t$ | 1 |

Number of paths simulated | $N$ | 100 |

### 2.2 Artificial trading rule simulation

ATRs are simulated in order to generate mean returns for each simulated market path. These ATRs are defined according to a probability of success $p$, which is simply the probability of the ATR producing a signal that will match the sign (positive or negative) of the market return at the applicable time step. Further, the number of ATRs simulated for testing and the length of the backtest (the strategy length) are also controlled parameters. Table 2 provides an overview of the selected ATR parameter ranges.

Parameter | Symbol | Possible values |
---|---|---|

Probability of success | $p$ | 0.25, 0.4, 0.5, 0.6, 0.75 |

Number of rules simulated | $K$ | 2, 10, 25, 50, 100 |

Strategy length (months) | $T$ | 12, 60, 120, 180 |

### 2.3 Statistical methods

After simulating the market data and ATRs, the mean returns for these strategies are calculated. The underlying question is whether the mean returns are significantly greater than zero.

The answer to that question is susceptible to the effects of DMB. Since we know (through controlling the ATR success parameter $p$) whether these strategies will be profitable or not, we can show how DMB presents itself in conventional $t$-tests and the extent to which it is eradicated using a variety of statistical methods. These statistical methods will now be briefly described.^{1}^{1}See Appendix A1 online for more mathematical detail.

#### 2.3.1 Controlling the family-wise error rate

The key problem in multiple hypothesis testing is that the probability of making a false discovery diverges from the initially defined $\alpha $ according to the number of hypotheses $K$ that are tested.^{2}^{2}For an illustrative example of this divergence, see Appendix A.1.1 online. This is known as DMB.

The family-wise error rate (FWER) is defined as the probability of making at least one false discovery when testing multiple hypotheses (Harvey and Liu 2013). Controlling the FWER is one of the fundamental tools needed to ensure that the number of falsely identified significant strategies can be bounded. The simplest way of controlling the FWER is known as the Bonferroni correction. This states that in order to preserve $\alpha $ as the actual type 1 error rate (the error rate of making a false discovery) we should set ${\alpha}^{*}=\alpha /K$ and use ${\alpha}^{*}$ as the critical value (the significance level) when performing each individual hypothesis test.

#### 2.3.2 Controlling the false discovery rate

The false discovery rate (FDR) is defined as the expected proportion of falsely rejected null hypotheses (Harvey and Liu 2013). The principle way of controlling the FDR is by using the Benjamini and Hochberg procedure (Benjamini and Hochberg 1995). This procedure determines an upper bound for the number of false discoveries produced. The procedure will now be formally outlined.^{3}^{3}For the motivation behind the proposal of this test by Benjamini and Hochberg (1995), see Appendix A.1.3 online.

Consider having already conducted the individual hypothesis tests ${\mathrm{H}}_{1},{\mathrm{H}}_{2},\mathrm{\dots},{\mathrm{H}}_{k}$ comparing our $K$ strategies to the zero-mean benchmark and obtaining the $p$-values associated with each: ${p}_{1},{p}_{2},\mathrm{\dots},{p}_{k}$. Let ${p}_{(1)}\le {p}_{(2)}\le \mathrm{\cdots}\le {p}_{(k)}$ be the ordered $p$-values for the corresponding hypothesis test ${\mathrm{H}}_{(1)},{\mathrm{H}}_{(2)},\mathrm{\dots},{\mathrm{H}}_{(k)}$. The Benjamini and Hochberg procedure states that in order to control for FDR at significance level $\alpha $, one should reject any ${\mathrm{H}}_{(i)}$ for which ${p}_{(i)}\le ((i)/K)\alpha $. The largest rejected ${p}_{(i)}$ is then referred to as the Benjamini and Hochberg threshold, and rejecting hypotheses according to this threshold ensures that

$$?\left[\frac{\text{falsediscoveries}}{\text{alldiscoveries}}\right]\le \alpha ,$$ |

and thus

$$?[\text{falsediscoveries}]\le \alpha ?[\text{alldiscoveries}].$$ |

In theory, this allows one to set the number of false discoveries at a level that leaves sufficient power to find other significant results.

#### 2.3.3 White’s reality check

White’s reality check (WRC) is the first comprehensive testing methodology to be implemented (White 2000).^{4}^{4}White (2000) extended the test developed by Diebold and Mariano (1995) to allow for comparisons between multiple hypotheses. A description of this test is available in Appendix A.1.4 online. The goal of WRC is to provide a multiple hypothesis testing method that moves away from a reliance on bounds, such as the Bonferroni correction, and that directly delivers appropriate $p$-values. White states that the null hypothesis tested should be

${\mathrm{H}}_{0}:\underset{k=1,\mathrm{\dots},K}{\mathrm{max}}E(s({u}_{k,t})-s({u}_{0,t}))\le 0,$ | |||

${\mathrm{H}}_{A}:\underset{k=1,\mathrm{\dots},K}{\mathrm{max}}E(s({u}_{k,t})-s({u}_{0,t}))>0,$ |

where $s(\cdot )$ is a defined “satisfaction” function, ${u}_{k,t}$ is the return produced by strategy $k$ at time $t$, and ${u}_{0,t}$ is the benchmark return at time $t$. In the context of our practical implementation, we define $s(\cdot )$ to be simply the mean strategy return, and the benchmark is set to a mean return of zero. In alternative implementations of WRC, $s(\cdot )$ can represent any performance metric (such as a Sharpe ratio, a Calmar ratio, value-at-risk, etc). Intuitively, this null hypothesis states that the best strategy encountered over a particular search has no superiority over the benchmark strategy with a mean return of zero. A further proposition in White (2000) states that, for $t=1,\mathrm{\dots},T$,

$$\underset{k=1,\mathrm{\dots},K}{\mathrm{max}}\frac{1}{\sqrt{T}}\sum _{t=1}^{T}((s({u}_{k,t+1})-s({u}_{0,t+1}))-E(s({u}_{k,t})-s({u}_{0,t})))\to \underset{k=1,\mathrm{\dots},K}{\mathrm{max}}{Z}_{k},$$ |

where ${Z}_{k}$ is $N(0,?\mathrm{ar}(s({u}_{k})-s({u}_{0})))$ distributed. Building from this, the corresponding test statistic is given by

$${\mathrm{TS}}_{k}^{\mathrm{WRC}}=\underset{k=1,\mathrm{\dots},K}{\mathrm{max}}\left(\frac{1}{\sqrt{T}}\sum _{t=1}^{T}(s({u}_{k,t+1})-s({u}_{0,t+1}))\right).$$ |

The distribution for the maximum of a Gaussian process is difficult to determine and is not a Gaussian process. As a result, the hypothesis for WRC is tested by generating a sampling distribution through the method of bootstrapping. $p$-values for WRC are then attained from this sample distribution. The implementation of this method closely follows the algorithm outlined by Aronson (2011).

#### 2.3.4 Hansen’s test for superior predictive ability

Hansen (2005) proposes a new test for checking whether a model has “superior predictive accuracy” (SPA) over a benchmark model. Hansen’s test is implemented in a similar way to WRC but uses a different test statistic and a sample-dependent distribution. Hansen defines the following studentized test statistic:

$${\mathrm{TS}}_{k}^{\mathrm{SPA}}=\mathrm{max}[\underset{k=1,\mathrm{\dots},K}{\mathrm{max}}\frac{(1/\sqrt{T}){\sum}_{t=1}^{T}(s({u}_{k,t+1})-s({u}_{0,t+1}))}{{\widehat{w}}_{k}},0],$$ |

where ${\widehat{w}}_{k}^{2}$ is some consistent estimator of

$${w}_{k}^{2}=?\mathrm{ar}\left(\frac{1}{\sqrt{T}}\sum _{t=1}^{T}(s({u}_{k,t+1})-s({u}_{0,t+1}))\right).$$ |

The estimator used in the practical implementation in this paper is

$${w}_{k}^{2}={B}^{-1}\sum _{b=1}^{B}({T}^{1/2}s({u}_{k,t+1})-{T}^{1/2}s({u}_{0,t+1})),$$ |

where $B$ is the number of resamples used in the bootstrapping process. $p$-values for each ${\mathrm{TS}}_{k}^{\mathrm{SPA}}$ are then obtained from a sample distribution generated through bootstrapping.

#### 2.3.5 Monte Carlo permutation

The Monte Carlo permutation (MCP) method, developed by Masters (2006), is based on the idea of testing whether an “informed” model is significantly superior to a “no-skill” model that is devoid of any predictive power. This method is implemented in our context according to the algorithm outlined by Aronson (2011), specifically, by using bootstrapping to generate the sample distribution of a “no-skill” rule’s backtested performance. This “no-skill” rule is created by randomly pairing the monthly returns of the simulated market with the ordered time series representing the sequence of ATR output values. This random pairing of rule values with market returns eliminates any predictive power the rule may have had and creates a “no-skill” rule, which Aronson refers to as a “noise rule”. The sample distribution of maximum mean returns produced by this noise rule is generated though bootstrapping, and the mean return produced by each ATR is tested against this to generate $p$-values.

#### 2.3.6 Corradi and Swanson’s extension

Corradi and Swanson (2011), building on the work of White (2000), introduced a novel testing approach (known as Corradi and Swanson’s extension (CS)) in which forecast combinations are evaluated through the examination of the quantiles of an expected loss distribution. In more detail, the models are compared by looking at cumulative distribution functions (CDFs) of prediction errors, for a given loss function, and the model whose CDF is stochastically dominated is chosen to be the best performing.^{5}^{5}See Appendix A.1.2 online for a full definition of first-order stochastic dominance.

The definitions originally used by Corradi and Swanson are converted for our application of testing trading strategies. A notable difference in our context is that we are looking for the rule that stochastically dominates the other rules in terms of our satisfaction function $s$, whereas Corradi and Swanson were looking for a rule that was stochastically dominated in terms of a loss function $g$. In order to use their methodology without major changes, we define $g=-s$. Minimizing the “loss” function $g$ will thus be the same as maximizing $s$. We note our usual definitions for ${u}_{k,t}$ and $s(u)$, letting ${F}_{g,k}(x)$ be the empirical distribution of $g({u}_{k,t})$ evaluated at $x$ and ${\widehat{F}}_{g,k,T}(x)$ be its sample analog, ie,

$${\widehat{F}}_{g,k,T}(x)=\frac{1}{T}\sum _{t=1}^{T}{\mathrm{?}}_{\{g({u}_{k,t})\le x\}}.$$ |

The corresponding hypotheses are

$${\mathrm{H}}_{0}:\underset{k>0}{\mathrm{max}}\underset{x\in X}{inf}({F}_{g,0}(x)-{F}_{g,k}(x))\ge 0$$ |

versus

$$ |

Consider the case of a single strategy ($k=1$) versus a benchmark strategy ($k=0$). If $({F}_{g,0}(x)-{F}_{g,1}(x))\ge 0$ for all $x$, then the CDF associated with the benchmark strategy always lies above the CDF of the competing strategy. Then, $g({u}_{0,t})$ is (first-order) stochastically dominated by $g({u}_{1,t})$ and the benchmark is preferred. The dominated strategy is preferred because we are dealing with losses, which we would like to minimize. Alternatively, if we reject the null hypothesis, it implies either that rule 0 stochastically dominates rule 1 or that the CDFs cross, and further analysis is required to select the best-performing rule.

Corradi and Swanson use the following test statistic when testing the null and alternative hypotheses outlined above:^{6}^{6}The minus sign in front of the statistic ensures that it does not diverge under the null hypothesis.

$${L}_{s,T}=-\underset{k>0}{\mathrm{max}}\underset{x\in X}{inf}\sqrt{T}({\widehat{F}}_{g,0,T}(x)-{\widehat{F}}_{g,k,T}(x)).$$ |

Bootstrapping is then used to construct an empirical distribution and compare the calculated sample test statistic against the percentile of this empirical distribution.

#### 2.3.7 Romano and Wolf’s test

Romano and Wolf (2005) propose a stepwise multiple testing procedure (also referred to as the step-M method) that asymptotically controls the FWER, with the use of studentization where applicable. Romano and Wolf’s method is thus a stepwise extension of WRC proposed in order to increase the power of the test, while still controlling the FWER at a given level $\alpha $. The first step of the step-M method is analogous to that of the WRC test. Let ${w}_{T,k}$ be a basic test statistic that will be used to test these hypotheses. First, obtain $K$ such test statistics, one for each strategy being tested against the benchmark. Then, order these statistics and relabel them, running from largest to smallest, with subscripts $(1)$ to $(K)$, such that ${w}_{T,(1)}\ge {w}_{T,(2)}\ge \mathrm{\cdots}\ge {w}_{T,(K)}$ and the corresponding hypotheses are ${\mathrm{H}}_{(1)},\mathrm{\dots},{\mathrm{H}}_{(K)}$. Individual decisions are then made for each ${w}_{T,(k)}$ in a stepwise manner. For the first step, a rectangular joint confidence interval with nominal joint coverage probability $1-\alpha $, for a chosen $\alpha $, is constructed. This confidence region is of the form

$$[{w}_{T,(1)}-{c}_{1},inf)\times \mathrm{\cdots}\times [{w}_{T,(K)}-{c}_{1},inf),$$ |

where ${c}_{1}$ is chosen to ensure the selected joint coverage probability, $1-\alpha $. This ensures that the FWER is maintained at $\alpha $. If a particular individual confidence interval $[{w}_{T,(k)}-{c}_{1},inf)$ does not contain zero, then the corresponding null hypothesis ${\mathrm{H}}_{(k)}$ is rejected. The stepwise component of this method is now realized. After $m$ hypotheses are rejected in the first step, a second rectangular joint confidence interval is constructed for the remaining hypotheses, again with a nominal joint coverage probability of $1-\alpha $. This confidence interval is of the form

$$[{w}_{T,(m+1)}-{c}_{2},inf)\times \mathrm{\cdots}\times [{w}_{T,(K)}-{c}_{2},inf),$$ |

where ${c}_{2}$ is selected to ensure the joint coverage probability of $1-\alpha $. This process is repeated until no further hypotheses can be rejected and one is left with a pool of significant strategies for which the FWER is controlled at $\alpha $. Romano and Wolf (2005) detail the method of finding the ${c}_{i}$ values exactly; the details are beyond the scope of this study. In our practical implementation, the ${c}_{i}$ values were obtained as the critical values of the sample distribution produced after bootstrapping at each step. This method of implementation is detailed in Chordia et al (2017).

## 3 Testing algorithm

The test carried out in this paper involves the simulation of a carefully controlled market environment with all parameters known at the outset. A sample path with a specific parameter set $\{{u}_{\mathrm{d}},{\sigma}_{\mathrm{d}},\lambda \}$ is simulated, and $K$ ATRs of length $T$ with a specified probability of success $p$ are generated. The “satisfaction” for each strategy is then calculated as an indicator of strategy performance. This is simply the ATR’s mean return over the simulated backtest.

The ATR with the best mean return is selected for testing, and a new market path is simulated. This is repeated $N$ times such that $N$ maximum mean returns and corresponding strategies are obtained for testing. This is done to simulate ($N$ times) the real-world situation of investors choosing the “best-of-$K$” strategy without accounting for DMB. The testing of $N$ maximums, as opposed to $N$ observed means of varying profitability, also serves to stress test the statistical methods themselves in order to observe their performance in extreme situations.

The significance of these $N$ “best-of-$K$” mean returns is then tested with $M$ different statistical methods. The null hypothesis for every statistical method implemented states that the mean return produced by each strategy is equal to zero. The $p$-values obtained through the testing give the probability that this null hypothesis is true. A $p$-value greater than 0.05 indicates that a mean return is not significantly different from zero, and a $p$-value less than 0.05 indicates the opposite.

## 4 Results

A total of 720 000 $p$-values were obtained after implementation (100 maximum returns $\times $ 8 statistical methods $\times $ 900 parameter combinations). Selected results were then used in order to study the questions of interest. In particular, we consider how the statistical methods compare in terms of eradicating DMB. Further, we study the effect of each simulation parameter on method performance. First, it is worthwhile to observe how each simulation parameter affects the magnitude of DMB in isolation.

### 4.1 Effect of simulation parameters on DMB

ATR return is determined purely by the ATR success rate. Since this rate was controlled from the outset, the expected return for each strategy is known, and the degree of DMB can thus be quantified. The magnitude of DMB was calculated as the absolute value of ${\overline{u}}_{k}-?[{\overline{u}}_{k}]$, where ${\overline{u}}_{k}$ is the mean return produced by each ATR and $?[{\overline{u}}_{k}]$ is given by

$$\frac{1}{T}\left[p\sum _{t=1}^{T}{u}_{k,t}-(1-p)\sum _{t=1}^{T}{u}_{k,t}\right].$$ |

The figures below plot the relationship between each variable and the magnitude of DMB present holding all other variables constant.

#### 4.1.1 ATR success rate

From Figure 1(a), it can be seen that the average magnitude of DMB is highest for an ATR success rate of 0.5. The DMB decreases on either side of this peak. The reason for this is that ATR success rate has an indirect effect on DMB through its contribution to volatility. The closer ATR success is to 0.5, the higher the volatility of returns produced by the trading strategies. Figure 1(e) illustrates that volatility has a significant effect on the average magnitude of DMB.

#### 4.1.2 Number of strategies tested

Number of strategies tested refers to the value of $K$, from which a best-of-$K$ strategy is selected. From Figure 1(b), it can be seen that, holding all else constant, the larger the number of strategies tested before a maximum is selected, the larger the magnitude of the DMB.

#### 4.1.3 Length of backtest

Varying the backtest length is analogous to varying the number of observations used to calculate the measure of performance used. As expected, and as represented by Figure 1(c), all else constant, the longer the backtest, the smaller the magnitude of DMB. The shape of the curve also suggests that there may be a point at which increasing backtest length may negate the effect of the other parameters and minimize DMB.

#### 4.1.4 Presence of outliers

The effect of an increase in the number of outliers was approximated by introducing jumps to the Merton jump model used to simulate market data. The frequency of these jumps was then increased (by manipulation of the parameter $\lambda $). All else constant, the more frequent the jumps, the larger the magnitude of DMB, as shown in Figure 1(d). However, this is a very small effect when compared with the other market parameters.

#### 4.1.5 Volatility

The market volatility was changed by manipulating the standard deviation of diffusion (${\sigma}_{\mathrm{d}}$). From Figure 1(e), it can be seen that an increase in this parameter greatly increased the magnitude of DMB present in the results. Volatility is seen to have the largest effect on the magnitude of DMB, and it will be worthwhile to see how the statistical methods perform in markets with high volatility.

### 4.2 General method performance

Table 3 shows method performance according to the proportion of rules identified as significant for each ATR success rate $p$.^{7}^{7}The $p=0.45$ and $p=0.55$ cases were tested and produced results very similar to those for $p=0.4$ and $p=0.6$, respectively. As such, they have been omitted from the tables. Table 3(a) represents a base case where parameters were set as $\{k=50,T=120,\lambda =0.08,{\sigma}_{\mathrm{d}}=0.1\}$. Parts (b)–(i) of Table 3 show a change in each individual parameter, holding all others constant, in order to highlight the change in method performance.

In evaluating the methods, graphs representing the evolution of average $p$-value against a change in each of the parameters (all else held constant at base values $\{p=0.5,k=50,T=60,\lambda =0.1,{\sigma}_{\mathrm{d}}=0.3\}$) were also plotted. Parts (a)–(e) of Figure 2 represent these relationships. FWER and FDR are methods that rely on altering bounds and do not generate $p$-values, so they cannot be included in these graphical representations.

For $p\le 0.5$, we expect no strategies to be significant, and we can thus quantify the number of false discoveries made by each method. Table 4 records the number of false discoveries made by each method for each $p\le 0.5$ as well as the total number (out of a total of 54 000 strategies) and proportion of false discoveries made. Table 4 includes expected false discoveries made due to the type 1 error rate of $\alpha =0.05$ and unexpected false discoveries made after accounting for $\alpha $.^{8}^{8}The probability of making a false discovery as allowed for by the statistical methodologies.

(a) Base case: $\{k=\text{50},T=\text{60},\lambda =\text{0.1},{\sigma}_{\mathrm{d}}=\text{0.3}\}$ | ||||||||
---|---|---|---|---|---|---|---|---|

$?$ | $?$ -test | FWER | FDR | WRC | Step-M | MCP | SPA | CS |

0.25 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

0.40 | 11 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

0.50 | 75 | 0 | 0 | 0 | 0 | 1 | 0 | 4 |

0.60 | 100 | 12 | 100 | 30 | 100 | 47 | 2 | 10 |

0.75 | 100 | 100 | 100 | 98 | 100 | 98 | 64 | 50 |

(b) $k=\text{25}$ | ||||||||

$?$ | $?$ -test | FWER | FDR | WRC | Step-M | MCP | SPA | CS |

0.25 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

0.40 | 9 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

0.50 | 51 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |

0.60 | 100 | 38 | 100 | 25 | 100 | 16 | 2 | 8 |

0.75 | 100 | 100 | 100 | 92 | 100 | 93 | 46 | 27 |

(c) $k=\text{100}$ | ||||||||

$?$ | $?$ -test | FWER | FDR | WRC | Step-M | MCP | SPA | CS |

0.25 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

0.40 | 20 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

0.50 | 93 | 0 | 0 | 3 | 7 | 3 | 1 | 6 |

0.60 | 100 | 32 | 100 | 48 | 100 | 34 | 5 | 11 |

0.75 | 100 | 100 | 100 | 98 | 100 | 97 | 83 | 21 |

(d) $T=\text{12}$ | ||||||||

$?$ | $?$ -test | FWER | FDR | WRC | Step-M | MCP | SPA | CS |

0.25 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |

0.40 | 28 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |

0.50 | 55 | 0 | 0 | 0 | 0 | 3 | 3 | 4 |

0.60 | 100 | 38 | 97 | 2 | 2 | 4 | 3 | 2 |

0.75 | 92 | 9 | 66 | 2 | 100 | 6 | 1 | 2 |

(e) $T=\text{180}$ | ||||||||

$?$ | $?$ -test | FWER | FDR | WRC | Step-M | MCP | SPA | CS |

0.25 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

0.40 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

0.50 | 87 | 1 | 1 | 1 | 65 | 1 | 0 | 1 |

0.60 | 100 | 92 | 100 | 100 | 100 | 95 | 78 | 33 |

0.75 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 86 |

(f) $\lambda =\text{0}$ | ||||||||

$?$ | $?$ -test | FWER | FDR | WRC | Step-M | MCP | SPA | CS |

0.25 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

0.40 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

0.50 | 59 | 0 | 0 | 0 | 0 | 2 | 0 | 1 |

0.60 | 100 | 58 | 100 | 30 | 100 | 38 | 11 | 13 |

0.75 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 52 |

(g) $\lambda =\text{0.2}$ | ||||||||

$?$ | $?$ -test | FWER | FDR | WRC | Step-M | MCP | SPA | CS |

0.25 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

0.40 | 21 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

0.50 | 93 | 2 | 70 | 1 | 0 | 2 | 1 | 2 |

0.60 | 100 | 72 | 100 | 29 | 100 | 41 | 3 | 13 |

0.75 | 100 | 100 | 100 | 97 | 100 | 99 | 55 | 39 |

(h) ${\sigma}_{\mathrm{d}}=\text{0.1}$ | ||||||||

$?$ | $?$ -test | FWER | FDR | WRC | Step-M | MCP | SPA | CS |

0.25 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

0.40 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |

0.50 | 76 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |

0.60 | 100 | 41 | 100 | 6 | 100 | 4 | 4 | 7 |

0.75 | 100 | 100 | 100 | 29 | 100 | 31 | 7 | 15 |

(i) ${\sigma}_{\mathrm{d}}=\text{0.5}$ | ||||||||

$?$ | $?$ -test | FWER | FDR | WRC | Step-M | MCP | SPA | CS |

0.25 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

0.40 | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |

0.50 | 93 | 0 | 22 | 0 | 1 | 3 | 0 | 2 |

0.60 | 100 | 35 | 100 | 37 | 100 | 34 | 6 | 16 |

0.75 | 100 | 97 | 100 | 99 | 100 | 99 | 89 | 43 |

(a) Including expected | ||||||||
---|---|---|---|---|---|---|---|---|

$?$ | $?$ -test | FWER | FDR | WRC | Step-M | MCP | SPA | CS |

0.25 | 234 | 3 | 6 | 0 | 0 | 0 | 0 | 23 |

0.4 | 1 979 | 101 | 398 | 1 | 0 | 6 | 13 | 87 |

0.5 | 10 601 | 662 | 3993 | 156 | 3178 | 220 | 78 | 371 |

Total | 12 814 | 766 | 4397 | 157 | 3178 | 226 | 91 | 481 |

Proportion | 0.42 | 0.014 | 0.08 | 0.003 | 0.06 | 0.004 | 0.002 | 0.009 |

(b) Unexpected | ||||||||

$?$ | $?$ -test | FWER | FDR | WRC | Step-M | MCP | SPA | CS |

0.25 | 123 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |

0.4 | 1 587 | 38 | 318 | 0 | 0 | 0 | 0 | 0 |

0.5 | 9 733 | 348 | 3602 | 9 | 2889 | 27 | 0 | 16 |

Total | 11 443 | 386 | 3921 | 9 | 2889 | 27 | 0 | 16 |

Proportion | 0.21 | 0.007 | 0.07 | 0.0001 | 0.05 | 0.0005 | 0 | 0.0003 |

#### 4.2.1 $t$-test performance

This method was implemented in order to show the effect of DMB on ordinary hypothesis tests. As can be seen in Table 4, for $p\le 0.5$ (which suggests all strategies are unprofitable), the $t$-test is highly affected by DMB, and 42% of the strategies tested are identified as falsely significant. Our significance level is set at 0.05. Therefore, we expect a maximum of 2700 false positives, but the $t$-test produces 12 814. The much larger number of false positives produced by the $t$-test gives an indication of how susceptible these simple methods are to the effect of DMB. Further, from parts (b)–(e) of Figure 2, we can see that the method does not improve as a result of changes in any simulation parameter values, as all average $p$-value curves move to 0, even though $p=0.5$. From Table 3, it can be seen that the $t$-test has sufficient power for $p>0.5$ (as close to 100 strategies are identified as significant in all sample cases); however, this power comes with a massive penalty to accuracy when $p\le 0.5$. Table 3 also shows us that the test is highly affected by changes in parameters that increase DMB. All multiple hypothesis methods perform better than the $t$-test in terms of eradicating DMB, and the results prove that the use of single hypothesis methods such as $t$-tests is not viable when data-mining.

#### 4.2.2 Controlling the family-wise error rate

In testing 100 best-of-$K$ strategies controlling the FWER using Bonferroni’s correction, we have, in almost all cases, controlled for DMB. As shown in Table 3, in no instance in the specimen sample of parameter combinations has the test identified a greater-than-expected number of false discoveries for $p\le 0.5$. However, FWER is shown to lack power, especially in cases where $p=0.6$, when all strategies should be identified as significant. Table 4 shows that there are cases in which controlling the FWER falls short, as 766 false discoveries are made. Although this number corresponds to 1.4% of all strategies tested and is below the expected false discovery proportion of 5%, it is interesting to note that 386 (0.7%) unexpected false discoveries are made. This indicates that there are nonprofitable strategies that have produced far enough outlying returns to cross the FWER-adjusted critical value. This suggests that in these extreme cases bound modifying methods, such as Bonferroni’s correction, fall short. Methods that directly deliver appropriate $p$-values such as WRC and MCP perform better in these situations.

#### 4.2.3 Controlling the false discovery rate

Controlling the FDR was a proposed stepwise modification to controlling the FWER in order to improve the power of the test. Table 3 shows that FDR performs well for $p\le 0.4$, and for $p\ge 0.6$ almost perfect outputs are generally generated. However, in the case of $p=0.5$, FDR is shown to be susceptible to the effects of changing simulation parameters that increase DMB (Table 3, parts (g) and (i)). From Table 4, it can be seen that the increase in the power of the test resulted in it producing 3921 unexpected false discoveries. This corresponds to just 7% of all strategies tested, so it may be seen as an appropriate sacrifice in accuracy in order to obtain the increased power. The results showed that under certain extreme stress tests FDR allows for too many additional false discoveries; however, in most situations the test performs as expected, and the increased number of false discoveries is offset by the power gained.

#### 4.2.4 White’s reality check

WRC is the first method employed for which $p$-values for each strategy are directly obtained from a sample distribution and bounds are not just modified. From Table 3, it can be seen that WRC performs well for $p\le 0.5$, and a minimal number of false discoveries are produced. WRC occasionally identifies a low number of significant strategies when $p>0.5$. This suggests that in certain conditions it has limited power to identify possibly profitable strategies. Table 4 illustrates how well WRC eradicates DMB, as it produces a proportion of 0.01% unexpected false discoveries. Parts (b)–(e) of Figure 2 also show that WRC deals well with changes in the simulation parameters, as there are no drastic falls in the average $p$-value produced for changes to any parameter value. Of all the methods tested, WRC is shown to be that which is least affected by DMB for $p=0.5$, as the average $p$-value curves produced are generally the closest to 1 over all parameter changes. WRC can therefore be seen as a conservative method that can be used in any market condition in situations where accuracy is preferred over power.

#### 4.2.5 Hansen’s test for superior predictive ability

The evaluation process shows that Hansen’s test for SPA demonstrates very similar performance to WRC. From Table 4, it can be seen that SPA perfectly eradicates DMB, as it makes 0 unexpected false discoveries. Hansen (2005) proposes that the SPA improves on the power of WRC. From Table 3, it can be seen that for $p\ge 0.6$, SPA identifies significantly fewer profitable strategies than WRC on almost all occasions, contrary to Harris’s claim. The reason for this is not exactly clear, but it may be due to the fact that SPA proposes to improve on the power of WRC by disregarding the nonprofitable strategies in the construction of the test statistic. However, in our best-of-$K$ implementation there are no unprofitable strategies to be disregarded, and this may negatively impact the power of the test. Parts (b)–(e) of Figure 2 show that for $p=0.5$, SPA does indeed have greater power than WRC, but in these cases higher $p$-values are preferred and the increased power does not translate when $p\ge 0.6$. This suggests that WRC and MCP are better-performing methods under our implementation.

#### 4.2.6 Monte Carlo permutation

MCP produces very similar results to WRC over the evaluation process. The method also almost perfectly eradicates DMB: only twenty-seven unexpected false discoveries are made (0.05% of all strategies tested). MCP’s power is also very close to WRC’s power. This is shown in Table 3 from the very close number of significant strategies identified when $p\ge 0.6$. It is also shown in Figure 2(a)–(e), as the average $p$-value curves produced by MCP generally lie very close to those of WRC. MCP can be viewed as a conservative method that can be used in any market conditions and is very similar to WRC; it may even be used to double-check results when WRC is used, as the methods are shown to correspond almost perfectly.

#### 4.2.7 Romano and Wolf’s test

The step-M method is a stepwise implementation of the WRC used in order to improve the power of WRC. Figure 2(a) shows that this power is indeed improved. Step-M produces perfect outputs in many of the specimen parameter combinations in Table 3, which indicates that in some cases this method will strike the perfect balance between power and accuracy. Due to the similar structure of the tests, the step-M method produces results very close to the results produced by FDR. When looking at Table 4, one can see that the method also suffers from the same pitfall: in extreme stress-testing situations, the added power from the stepwise repetition of WRC results in the method producing 2889 unexpected false discoveries (5% of all strategies tested). Figure 2(c) shows such a case, as the method’s accuracy is seriously impaired when $k=100$ (the $p$-value curve goes to 0). However, in all other cases, the accuracy is maintained over changes in simulation parameters (the $p$-value curve is close to 1). This suggests that the step-M method could be used in situations where WRC and MCP are considered to be too conservative.

#### 4.2.8 Corradi and Swanson’s extension

CS performs similarly to WRC. For $p\le 0.5$, CS is shown to almost perfectly eradicate DMB in both Table 3 and Table 4. CS only produces sixteen unexpected false discoveries out of 54 000. Corradi and Swanson (2011) propose that CS improves on the power of WRC. However, in Table 3 for $p\ge 0.6$, the number of significant strategies that CS exclusively identifies is less than or equal to the number identified by WRC. This suggests that the method has less power to identify significant strategies than does WRC. Parts (b)–(e) of Figure 2 show that the average $p$-value curve for CS is indeed below that of WRC, as Corradi and Swanson (2011) suggest when $p=0.5$, but this increased power does not seem to translate as $p$ increases, as can be seen in Figure 2(a) (the $p$-value curve crosses above the curves for WRC and MCP when $p=0.75$).

## 5 Conclusion

This study reviewed the various statistical methodologies in place to test multiple systematic trading strategies, implementing these methodologies under simulation with known ATRs in order to critically compare and evaluate them. This evaluation was done using a best-of-$K$ simulation method in order to simulate the real-world situation of investors selecting their best strategy $N$ times. The results of the study showed that WRC and MCP were the best-performing methods in terms of eradicating DMB. Step-M was shown to be the most viable method for use in the case where WRC and MCP are too conservative. Claims made by Hansen (2005) and Corradi and Swanson (2011) that SPA and CS were more powerful than WRC were shown to not hold under our implementation. Further, methods that rely on the manipulation of bounds, such as FWER and FDR, were shown to be inferior to their counterparts that directly produce $p$-values, such as WRC and step-M, in terms of eradicating DMB.

## Declaration of interest

The authors report no conflicts of interest. The authors alone are responsible for the content and writing of the paper. The views expressed are our own and any errors made are our responsibility.

## References

- Aronson, D. (2011). Evidence-Based Technical Analysis: Applying the Scientific Method and Statistical Inference to Trading Signals. Wiley.
- Bailey, D. H., Borwein, J. M., de Prado, M. L., and Zhu, Q. J. (2014). Pseudomathematics and financial charlatanism: the effects of backtest overfitting on out-of-sample performance. Notices of the American Mathematical Society 61(5), 458–471.
- Benjamini, Y., and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society B 57(1), 289–300.
- Chordia, T., Goyal, A., and Saretto, A. (2017). $p$-hacking: evidence from two million trading strategies. Research Paper No. 17-37, August 12, Swiss Finance Institute (https://doi.org/10.2139/ssrn.3017677).
- Corradi, V., and Swanson, N. R. (2011). The White reality check and some of its recent extensions. Working Paper. URL: https://bit.ly/2xe8IBc.
- Diebold, F. X., and Mariano, R. S. (1995). Comparing predictive accuracy. Journal of Business & Economic Statistics 13(3), 253–263.
- Hansen, P. R. (2005). A test for superior predictive ability. Journal of Business & Economic Statistics 23(4), 365–380.
- Harris, M. (2016). Limitations of quantitative claims about trading strategy evaluation. Working Paper, July 15, Social Science Research Network (https://doi.org/10.2139/ssrn.2810170).
- Harvey, C. R., and Liu, Y. (2013). Multiple testing in economics. Working Paper, November 21, Social Science Research Network (https://doi.org/10.2139/ssrn.2358214).
- Harvey, C. R., and Liu, Y. (2014). Evaluating trading strategies. Journal of Portfolio Management 40(5), 108–118.
- Masters, T. (2006). Monte Carlo evaluation of trading systems. Working Paper. URL: http://www.evidencebasedta.com/montedoc12.15.06.pdf.
- Novy-Marx, R. (2016). Testing strategies based on multiple signals. Working Paper. URL: https://bit.ly/2NFjWck.
- Peterson, B. (2015). Developing & backtesting systematic trading strategies. Research Paper. URL: https://bit.ly/2r9zgBF.
- Romano, J. P., and Wolf, M. (2005). Stepwise multiple testing as formalized data snooping. Econometrica 73(4), 1237–1282.
- White, H. (2000). A reality check for data snooping. Econometrica 68(5), 1097–1126.

Only users who have a paid subscription or are part of a corporate subscription are able to print or copy content.

To access these options, along with all other subscription benefits, please contact info@risk.net or view our subscription options here: http://subscriptions.risk.net/subscribe

You are currently unable to print this content. Please contact info@risk.net to find out more.

You are currently unable to copy this content. Please contact info@risk.net to find out more.

Copyright Infopro Digital Limited. All rights reserved.

As outlined in our terms and conditions, https://www.infopro-digital.com/terms-and-conditions/subscriptions/ (point 2.4), printing is limited to a single copy.

If you would like to purchase additional rights please email info@risk.net

Copyright Infopro Digital Limited. All rights reserved.

You may share this content using our article tools. As outlined in our terms and conditions, https://www.infopro-digital.com/terms-and-conditions/subscriptions/ (clause 2.4), an Authorised User may only make one copy of the materials for their own personal use. You must also comply with the restrictions in clause 2.5.

If you would like to purchase additional rights please email info@risk.net