Journal of Risk Model Validation
ISSN:
1753-9579 (print)
1753-9587 (online)
Editor-in-chief: Steve Satchell
Need to know
- Using the ordinary hold-out (‘train/test split’) procedure on a specific set of assets is inappropriate to control for overfitting in NN-based portfolio optimization.
- We confirm the overall assessment of the out-of-sample performance of estimation-based strategies in terms of Sharpe ratio, Sortino ratio and CEQ given in the literature.
- We prove that NN-based strategies, if set up correctly, systematically outperform the 1 / N strategy.
Abstract
In this paper we measure the out-of-sample performance of sample-based rolling-window neural network (NN) portfolio optimization strategies. We show that if NN strategies are evaluated using the holdout (train–test split) technique, then high out-of-sample performance scores can commonly be achieved. Although this phenomenon is often employed to validate NN portfolio models, we demonstrate that it constitutes a “fata morgana” that arises due to a particular vulnerability of portfolio optimization to overfitting. To assess whether overfitting is present, we set up a dedicated methodology based on combinatorially symmetric cross-validation that involves performance measurement across different holdout periods and varying portfolio compositions (the random-asset-stabilized combinatorially symmetric cross-validation methodology). We compare a variety of NN strategies with classical extensions of the mean–variance model and the 1 / N strategy. We find that it is by no means trivial to outperform the classical models. While certain NN strategies outperform the 1 / N benchmark, of the almost 30 models that we evaluate explicitly, none is consistently better than the short-sale constrained minimum-variance rule in terms of the Sharpe ratio or the certainty equivalent of returns.
Introduction
1 Introduction
In 1952, Harry Markowitz derived the optimal strategy for the allocation of wealth across risky assets for an investor who cares about only the mean and the variance of portfolio returns (Markowitz 1952). The implementation of this strategy involves the estimation of moments by their historic sample analogues, which entails estimation errors. The impact of these errors turns out to be substantial, leading to extreme portfolio weights that fluctuate over time and an overall poor performance (Michaud 1989; Litterman 2003). A vast body of literature has since evolved on methods to improve the Markowitz model, including classical strategies that focus on an improved management of estimation error and, more recently, machine-learning-based techniques. This paper provides a detailed comparison of the out-of-sample performance of sample-based rolling-window neural network (NN) portfolios with the performance of strategies that select the portfolio composition based on estimated quantities without using NNs.
Classical estimation-based methods focus on improved moment estimation and/or an improved handling of the impact of estimation errors. The literature ranges from Bayesian models, such as Bayes models with diffuse priors (Barry 1974; Bawa et al 1979), Bayes shrinkage estimators (Korkie et al 1979; Jobson and Korkie 1980; Jorion 1985) and Bayes models with priors from asset-pricing models (Pástor 2000; Pástor and Stambaugh 2000), to portfolios that diversify across market and estimation risk (Kan and Zhou 2007) and portfolios that focus on stabilizing the estimation of the covariance matrix (Best and Grauer 1992; Chan et al 1999; Ledoit and Wolf 2004a, b) to portfolio rules that impose short-selling constraints (Frost and Savarino 1988; Chopra 1993; Jagannathan and Ma 2003). The NN-based portfolio strategies considered here circumvent the need to explicitly estimate expected returns and covariances by extracting abstract and potentially more robust and informative features. NN-based portfolio optimization methods range from simple feedforward NNs and long short-term memory (LSTM) methods (Zhang et al 2020; Ma et al 2020) to reinforcement learning systems such as neural bandit models (Riquelme et al 2018), orthogonal bandits (Shen et al 2015) and risk-adjusted bandits (Xiaoguang and Feng 2017). There are also “semi-NN strategies”, which use an NN to estimate moments but a classical portfolio strategy to compute weights (Freitas et al 2009; Du 2022). Finally, there is the strategy (Epstein and Simon, 1960, Baba Mezi’a 42a), which ignores the data entirely and allocates an equal portion of wealth to each asset.
It is a somewhat intimidating insight that, despite 70 years of theoretical development accompanied by portfolio models of increasing sophistication, it is still difficult to outperform the rule. Experimental studies demonstrate that of the mentioned estimation-based portfolio models (Markowitz 1952; Hodges and Brealey 1972; Michaud 1989; Litterman 2003; Best and Grauer 2015; Barry 1974; Bawa et al 1979; Korkie et al 1979; Jobson and Korkie 1980; Jorion 1985; Pástor 2000; Pástor and Stambaugh 2000; Goldfarb and Iyengar 2003; Garlappi et al 2007; Kan and Zhou 2007; MacKinlay and Pástor 2000; Best and Grauer 1992; Chan et al 1999; Ledoit and Wolf 2004a, b; Frost and Savarino 1988; Chopra 1993; Jagannathan and Ma 2003), across various empirical data sets of monthly returns “none is consistently better than the naive benchmark in terms of Sharpe ratio, certainty-equivalent return or turnover” (DeMiguel et al 2009a; see also Bloomfield et al 1977; Jorion 1991; Kritzman et al 2010). It is well known that a long time series of asset prices is needed to estimate expected returns (Merton 1980). In a similar vein, the estimation of covariances behaves poorly when the price history is limited and the inversion of the covariance matrix (which is often needed to compute portfolio weights) is sensitive to estimation errors. In other words, the problem is largely attributable to the data, namely, a low signal-to-noise ratio and an insufficient quantity for reliable signal extraction.11 1 It speaks volumes about the difficulty of modeling with financial data that a similar “catastrophe” has overtaken the Nobel Prize-winning capital asset pricing model, where Fama and French (2004) report that “the empirical record of the model is poor – poor enough to invalidate the way it is used in applications”. Given this issue, a potential preconception might be that machine learning strategies suffer from data constraints in a similar way to the estimation-based strategies.
The aim of this paper is to compare the out-of-sample performance of NN portfolio strategies (specifically, the varying architectures of the running-window feedforward NN, the running-window LSTM and the respective models with short-sale constraint) with that of the classical estimation-based strategies across a variety of portfolio compositions and performance metrics. We measure the out-of-sample expected returns, their standard deviation, the Sharpe ratio, the Sortino ratio and the certainty equivalent (CEQ) of the expected utility of a mean–variance investor on a data set of daily returns from almost 2000 US assets over a 20-year period between 2000 and 2021. At the core of this analysis stands a dedicated cross-validation methodology (the random-asset-stabilized combinatorially symmetric cross-validation (RAS-CSCV)) methodology that allows for a rigorous comparison of the performance metrics of machine-learning-based portfolio strategies in view of overfitting. Our main empirical findings can be summarized as follows.
- (1)
We demonstrate that common methods to control for overfitting are inappropriate for the portfolio optimization task.
- (2)
- (3)
We show that NN-based strategies, if calibrated correctly, indeed perform better than the benchmark within the scope of our data set.
But before going into detail, we begin with an explanation as to why a dedicated methodology is needed for a seemingly simple comparison.
Care should be taken when evaluating the performance of NN portfolios. First, on the technological side, the simple fact that NNs require training distinguishes them from the estimation-based strategies. In the case of the estimation-based strategies the researchers’ knowledge influences the design of a specific model, which is accomplished by the estimation of model parameters (such as the mean or the variance). An untrained NN comes with no “knowledge”; its architecture and parameters have to be fitted to the target. Maintaining the balance between underfitting and overfitting is a perennial concern for machine learning (Bishop 2007), but it turns out to be particularly difficult when dealing with portfolio strategies. To avoid overfitting it is common to train NNs via a holdout procedure. Only a part of the data, the “training set”, is used for training, while the rest is kept aside as a “test set” in order to measure the trained models’ performance on unseen data of a similar type. In practice, model selection with NNs involves iterative training and testing over a number of possible configurations (such as NN architectures, sets of hyperparameters and numbers of training periods) to tune the system for better performance. While this is appropriate for many problems of practical interest, the underlying assumption that overfitting has been avoided as long as the performance is satisfactory for the unseen data can be deceptive. In fact, the application of NNs for portfolio optimization bears considerable risk of overfitting to both training and test sets.22 2 We leave the semantic question of whether these phenomena should be termed “overfitting” or “underfitting” or a combination of both to the preference of the reader.
- (1)
The holdout/train–test split procedure is inadequate if the sample size is small. The training set will be too short for the NN to learn and the test set will be too short to conclude anything with confidence. For image classification the general rule of thumb is for at least 1000 observations per class (CireÅ̧an et al 2012) for deep NN training (corresponding to roughly 80 years of monthly data per asset). In a similar spirit, Weiss and Kulikowski (1991) argue that the holdout method should not be applied with fewer than 1000 observations. However, the signal-to-noise ratio is lower in financial data, which entails an “inflation” of the size requirement and/or the usage of smaller NNs. Another point of reference is given by DeMiguel et al (2009a), who argue that for variations of the sample-based mean–variance strategy with 25 assets, around 3000 observations of monthly returns are needed to outperform the 1/25 portfolio.
- (2)
If an NN model is tuned over a large number of configurations and architectures it becomes likely that from time to time a good performance is observed on the unseen data – even in cases where the model has learned almost nothing. This is a particular risk for portfolio optimization since it is possible to select assets that perform well over the test set by chance. As a benchmark we measure the test performance of randomly composed portfolios in our experiments.
- (3)
Even with enough data available, the location and size of the test set remain potential sources of overfitting. For instance, if the test set happens to coincide with a period of uniform market well-being, then the high-performance metrics might persuade us to accept an invalid strategy that would have performed poorly in a different market. If the test set is taken from the end of a time series, then the most recent and most informative observations are lost for training. If the test set is taken from the beginning of the time series, then the model is tested on the least representative portion of the data. Different holdouts are thus, as described earlier, likely to lead to different conclusions, with a considerable risk of overfitting (Hawkins 2004).
- (4)
The choice of assets bears the risk of overfitting too. We might be tempted to base experiments on recognizable names (such as the stocks of market leaders) or involuntarily discard unsuccessful names, creating methodological bias toward successful assets. Whatever the model, if it is trained and tested on a given list of assets, then it is expected to perform well under market situations that are similar to those encountered during training and testing. If the chosen assets are informative only of a specific market situation (eg, the dynamics of a specific industry sector), there is little reason to assume that the model would perform well in a radically different environment. Any manual choice of assets eventually leads to model architectures that are biased toward the statistics of the environment represented by this choice.
Overall, the common practice of measuring the performance of an NN portfolio for a specific choice of assets on a specific temporal split of their time series data almost immediately results in overfitting. We demonstrate this with some simple experiments. This phenomenon can be summarized as a form of “backtest overfitting”, where excessive model calibration, poor data and an overly easy test target lead to spuriously high out-of-sample performance metrics. Backtest overfitting has, of course, been studied in great detail in the literature (see, for example, Bailey et al 2017, 2014), but the particular training procedure for NNs together with the aforementioned data limitations imply that backtest overfitting is a systemic problem present in NN portfolio strategies. Further, with respect to the scientific literature, there appears to have evolved the tendency that the greater the financial interests and the hotter the topic, the less likely the results are to be accurate (Ioannidis 2005; Economist 2013). Such concerns are particularly serious and also well documented in the context of investment decisions (Bailey et al 2017). In summary, NN-based portfolio optimization is located in a “hot-spot region”, with social, technological and data-related challenges. This requires a certain level of scepticism with respect to positive reports and commercial promises and it motivates a benchmark comparison of NN portfolios with other strategies. Our contribution is to assess common NN models and training procedures in view of the above points. Our main experimental findings can be summarized as follows.
- (1)
Use of the ordinary holdout (“train–test split”) procedure on a specific set of assets is inappropriate to control for overfitting in NN-based portfolio optimization. Following this procedure, we almost immediately obtain spuriously high out-of-sample performance metrics for varying NN architectures.
- (2)
We confirm the overall assessment of the out-of-sample performance of estimation-based strategies in terms of the Sharpe ratio, the Sortino ratio and the CEQ given in the literature (see, for example, DeMiguel et al 2009a; Jagannathan and Ma 2003). Consistent with previous reports (Jagannathan and Ma 2003; Kritzman et al 2010), we observe the strongest performance of all estimation-based strategies for the short-sale constrained minimum-variance strategy.
- (3)
We prove, using a dedicated methodology for performance measurement, that NN-based strategies, if set up correctly, systematically outperform the strategy. In our setup they outperform the short-sale constrained minimum-variance strategy in terms of expected returns and the Sortino ratio, but they underperform it in terms of the standard deviation of returns, the Sharpe ratio and the CEQ. Setting up the NN strategies requires advanced techniques such as weight-shrinkage, weight-regularization and the subtle tuning of regularization and model architectures.
The remainder of this paper is structured as follows. Section 2 describes our methodology: Section 2.1 contains the methods of performance measurement, and Section 2.2 describes the asset allocation models as well as the NN training procedures. Section 3 contains our experimental work: Section 3.1 describes our experiments, Section 3.1.1 describes our data and Sections 3.1.2 and 3.1.3 describe the NN models under investigation; Section 3.2 presents our experimental findings (Section 3.2.1 displays simple overfitting cases, while Section 3.2.2 focuses on the selection of NN strategies and Section 3.2.3 compares the chosen NNs with estimation-based strategies). Section 4 states our conclusions.
2 Methodology
2.1 Methodology for performance evaluation
We investigate portfolio optimization with assets, which are described by a time series of a total of daily prices: , , . Our portfolio strategies select the relative weights , , , with for the allocation of investment in a forthcoming period of time . The investment decision (ie, the computation of weights) is based on the asset prices of a past period of possibly different length.
2.1.1 Rolling-window performance measurement
We define the measures of performance employed for our comparison. With the weights being updated monthly, the return for day in month of a portfolio with weights is given by
(2.1) |
Given the time series of daily out-of-sample returns, we compute the following quantities for each strategy. The empirical mean and standard deviation of continuously compounded daily out-of-sample returns in month are
(2.2) | ||||
(2.3) |
where denotes the number of business days in month . The out-of-sample mean and standard deviations over a period of months are then given by
The out-of-sample Sharpe ratio and CEQ are computed as
The downside deviation (DD) and Sortino ratio are computed by taking account of only the negative returns in the variance. Let be the subset of the days of months with negative returns (ie, ) if and only if , and let be the number of days in . We compute as before
Note that the way these quantities are defined varies in the literature. Our formulas are identical to those of Makridakis et al (2022), but they are different from those of Zhang et al (2020) by a logarithm in the compounding of returns.
2.1.2 Train–test splitting and CSCV
CSCV was proposed by Bailey et al (2017) as a general method of controlling overfitting in backtesting procedures. We adapt this method to the context of NN portfolio optimization. The basic idea is to measure the performance of an NN portfolio through multiple training and testing cycles on different subsets of the data. Specifically, CSCV subdivides the data set into an even number, , of equally sized and disjoint subsets. Then, equally sized training and test sets are formed by the respective aggregation of of the subsets. Since there is a total of ways to choose subsets of the joint sets, this procedure generates a total of training and test sets. The validation procedure is symmetric because each combination of subsets is interpreted once as the training set and once as the test set. For example, if , then the data set is split into subsets and as . First, we train the NN on and test it on , then we train exactly the same configuration on and test it on . If , then the data is split as . First, we train the NN on and test it on , then we train it on and test it on , then we train it on and test it on , and then the whole procedure is repeated with training and test sets reversed.
We execute CSCV to control for overfitting along the time dimension (corresponding to the indexes of ). Specifically, in view of problems (1)–(3) in Section 1, we fix a list of assets, a list of competing strategies, the NN architecture, all hyperparameters (ie, we abstain from hyperparameter tuning) and we train and test the NN on CSCV subsets at disjoint intervals of time. For each CSCV combination we record the performance statistics of all competing strategies, ranking the performance metrics relative to competitor strategies and the portfolio. We record the means and medians of out-of-sample Sharpe ratios and we count how often the strategies perform better than the portfolio. We choose the portfolio as our benchmark because it is uninformed. It is natural to require any trained strategy to have a higher test performance than irrespective of the specific split. The performance of the strategy can also be viewed as a quantification of the overall market performance of the selected assets. A high (low) Sharpe ratio reflects overall favorable (unfavorable) market conditions.
Figure 1 shows scatter plots of (in-sample/out-of-sample) Sharpe ratio pairs recorded from the CSCV for of two distinct NN models executed on the same randomly chosen assets (see Section 3 for details regarding the experimental setup).33 3 The assets are “ALV”, “CEV”, “IX”, “MCO”, “SNA”, “SNDA”, “HMN”, “TWN”, “NVR”, “SNV”, “SHYF”, “UHAL”, “LXU”, “ESP”, “EHC”, “TELL”, “SLG”, “AIRT”, “PSFW”, “KEX”. Part (a) of the figure shows a well-calibrated model (LSTM-NN-SC IV; see Section 3) with a CSCV average in-sample Sharpe ratio of and an average out-of-sample Sharpe ratio of 0.758. The configuration in part (b) (FF-NN-SC I; see Section 3) has a higher average in-sample Sharpe ratio (of ) and a lower average out-of-sample Sharpe ratio (of ), indicating overfitting (to the training set). For comparison, scatters of (in-sample/out-of-sample) tuples from the portfolio are added. As no calibration is involved, the performance of the portfolio depends only on the market in the respective periods of time. In this example, the average out-of-sample Sharpe ratio is for the strategy. Because CSCV treats training and test sets symmetrically, scatters of the strategy are always distributed symmetrically with respect to the diagonal . If performance were measured by averaging the performances on individual CSCV subsets,44 4 Examples of such measures of performance are the mean or the average monthly Sharpe ratio. then all scatters would fall on one line with a slope of . Note that we compute the Sharpe ratios from the means and variances of all returns over train–test sets. In this case, CSCV scatters are still symmetric with respect to , but they are no longer located exactly on one line (see Figure 1(a)). To show the degree of performance degradation under varying market conditions, a linear regression is conducted on the Sharpe pairs, whose slope coefficient is indicated in the figure notes.
Finally, it should be mentioned that CSCV destroys time series correlations along its splits. If CSCV were applied with a number of splits that is large compared with the length of the time series, this might result in an overly pessimistic assessment of the number of NN strategies that accurately learn these correlations. In our experiments the number of splits will be small compared with the length of the time series.
2.1.3 Random asset selection
While we employ CSCV to detect overfitting along the time dimension, the choice of assets for the investment portfolio (corresponding to the indexes of ) remains an important vulnerability. Since these assets are fixed in advance, even a good CSCV performance leaves the NN trained and tested on a small subset of market data. Such an NN can be expected to perform well only in specific market situations marked out by the chosen assets (see problem (4) in Section 1). No matter which assets are chosen, they will reflect only a fraction of the plausible market dynamics. CSCV might suggest that a model is valid for one choice of assets, but the same model might be rejected for another choice. To address this point we use random asset selection alongside CSCV and we record joint performance scores both over the train–test splits of CSCV and over the chosen portfolios. Our procedure, RAS-CSCV, follows CSCV in that we fix a number of strategies and we compare performance metrics across a collection of training and test sets. RAS-CSCV goes beyond CSCV in that the training and test sets and the portfolio composition are chosen (uniformly at random) to check for overfitting (eg, computing the probability of backtest overfitting (Bailey et al 2017)). The implicit reasoning is that the setup of the NN strategy (ie, the architecture and specific calibration, but not the NN parameters) should reflect the features of a typical market, which are universal beyond the specific environment marked out by a choice of assets. Specifically, for a fixed strategy, first a list of assets is chosen uniformly at random, then the NN training and testing are organized in the CSCV scheme. We record the performance metrics of competing strategies over both randomly chosen portfolios and CSCV subsets, computing the average and median out-of-sample Sharpe ratios (taking account of both CSCV subsets and portfolios), and we count how frequently the model outperforms the benchmark. To assess the impact of portfolio size on performance, we further enhance the RAS-CSCV procedure by choosing portfolios of sizes .
The RAS-CSCV procedure is illustrated in Figure 2. Part (a) shows a scatter plot of (in-sample/out-of-sample) Sharpe ratio pairs obtained simultaneously through the choice of 300 random portfolios of sizes and CSCV with from an NN-based strategy (LSTM-NN-SC IV; see below) and the strategy. For easier comparability we also added a linear regression over all scatters. The investigation of bulk scatter shifts allows us to draw a more reliable conclusion about the performance of the compared models. Since NNs are trained, their scatters will usually be shifted toward larger in-sample performance metrics than those of the strategy. From a well-calibrated NN model we also demand a higher average out-of-sample performance. For example, the model in part (a) has slightly higher average in-sample and out-of-sample Sharpe ratios than the strategy. The investigation of regression slope statistics can provide valuable additional information.
Part (b) shows two histograms of realized regression slopes of CSCVs computed individually for each random portfolio ( in Figure 1) for both the strategy and the NN model. In the case of the strategy, the slope is exactly for any portfolio if a performance measure is chosen that computes the train–test performance based on averaging performances of individual CSCV subsets. This figure demonstrates that regression slopes are also typically close to for the Sharpe ratio as computed in Section 2.1. A smaller indicates that a strategy has a tendency to perform better out-of-sample under good market conditions but worse under bad market conditions. On the other hand, a larger indicates that a strategy has a tendency to perform better out-of-sample under bad market conditions but worse under good market conditions. From a practical perspective neither situation is desirable because it indicates a bias toward certain market conditions. For a well-calibrated model the slopes’ statistics should concentrate around the slope of the strategy (see Figure 2(b); cf. Figure 5(b) for an overfitting model). We say that the CSCV performance of strategy I is better than that of strategy II on a list of assets with respect to a certain performance metric if the CSCV regression line of strategy I is parallel shifted upward with respect to that of strategy II. We say that strategy I has better RAS-CSCV performance than another strategy II with respect to a certain performance metric if the RAS-CSCV regression line of strategy I is parallel shifted upward with respect to that of strategy II. Figure 1(a) illustrates the CSCV results and Figure 2(a) illustrates the RAS-CSCV with a better Sharpe ratio performance of an NN model than the strategy.
2.2 Asset-allocation models under consideration
2.2.1 portfolio allocation rule ( strategy)
The portfolio strategy implements a “naive” diversification rule in that it invests equal weights in each of the assets for every investment period . This strategy ignores all data, with the disadvantage that any potential evidence is ignored too. It has the advantage that it does not involve estimation or optimization or any costly computation that might constitute a burden for larger portfolios. By assigning weights to all assets it diminishes the potential for diversification. The larger is, the harder it becomes for estimation-based models to outperform the rule. Increasing implies a larger potential for diversification and simultaneously increases the number of parameters in estimation-based models. This leads to higher data consumption and less robust estimates. The same reasoning applies to the parameters in NN-based strategies. This is supported by our experimental findings (see Section 3). Another important advantage of the rule is that it has been well studied in experiments, is commonly employed as a benchmark (DeMiguel et al 2009a) and is known to perform well in practice (Benartzi and Thaler 2001; Huberman and Jiang 2006).
2.2.2 Sample-based mean–variance portfolio
The mean–variance portfolio strategy (Markowitz 1952) addresses the situation in which investors’ preferences are described in terms of the mean and the variance of a portfolio. For each investment period the investor aims to select portfolio weights so as to maximize the expected utility:
In this formula, denotes the expected return of asset over the period , are the covariances between assets for this period and measures the investor’s risk aversion. There is no short-selling constraint, which means that weights may be negative. The sample-based mean–variance portfolio model (MeanVar) replaces the expected means and covariances by the in-sample analogues , , which are typically estimated based on a preceding period of price observations. The presence of estimation error in , is ignored by this model, which leads to an overall poor performance (Hodges and Brealey 1972; Michaud 1989; Litterman 2003; Best and Grauer 2015). Note that the model can be viewed as a special case when the expected returns and variances are identical across all assets and the cross-covariances between assets are zero.
2.2.3 Sample-based minimum-variance portfolio
The minimum-variance portfolio strategy can be viewed as an edge case of the MeanVar strategy, where the investor aims to reduce variance but ignores the expected returns. For each investment period the investor would like to select the portfolio weights in such a way as to minimize the variance:
The sample-based minimum-variance portfolio model (MinVar) replaces the expected covariances by in-sample analogues , which are typically estimated based on a preceding period of price observations. Note that this model “lies between” the MeanVar and the models, in that it arises from the MeanVar model when the investor restricts expected returns to be identical across all assets but still estimates the covariances.
2.2.4 Short-sale-constrained sample-based mean–variance and minimum-variance portfolios
The minimum-variance portfolio strategy with short-selling constraint (MinVar-SC) and the mean–variance portfolio strategy with short-selling constraint (MeanVar-SC) arise from the MinVar and MeanVar strategies, respectively, by imposing the condition that weights must not be negative (that is, ). Imposing a short-sale constraint on MeanVar can be interpreted as a particular form of shrinkage (in the sense of the James–Stein estimator (James and Stein 1992; Efron and Morris 1973)), where the expected return is shifted toward the average. In a similar manner, Jagannathan and Ma (2003) show that a short-sale constraint on MinVar is equivalent to shrinking the elements of the variance–covariance matrix toward the identity, similar to the Ledoit–Wolf type estimators (Ledoit and Wolf 2004a, b). In summary, the effect of a short-selling constraint is “to bring the portfolio closer to ” by shrinking the expected returns or covariances. MinVar-SC performed best among the portfolios considered by DeMiguel et al (2009a) and Jagannathan and Ma (2003).
2.2.5 Rolling-window feedforward-NN portfolio
The rolling-window NN portfolio strategy is a parametric function that computes the portfolio weights , , for a given investment period from the prices , , on a preceding data window of fixed length . The data window follows the investment period through the available data, where the NN always takes arguments and returns portfolio weights. We transform the price data as prior to applying the NN. The NN is trained by presenting it with a list of input samples (the training set) and optimizing its tunable parameters to maximize the chosen investment target (“to minimize empirical risk”). Specifically, in period we use the portfolio weights to compute the daily out-of-sample returns by (2.1) and their empirical mean and standard deviation by (2.2) and (2.3), respectively. The out-of-sample CEQ for this time period is and the Sharpe ratio is .
The rolling-window feedforward-NN (FF-NN) portfolio model employs a fully connected feedforward NN for weight computation. This type of NN is composed of a concatenation of a number of layers. Each layer consists of an affine transformation, whose parameters can be tuned, followed by a fixed nonlinear entry-wise transformation (Schmidhuber 2015). The architectural details are discussed in Section 3.1. Regularization is a standard technique to combat overfitting. The improvement of performance by the transition from MeanVar to MinVar to MinVar-SC demonstrates the positive impact of shrinkage. This suggests adding a regularizer of the form
to the loss, penalizing deviations of the NN weights from . It is well known that portfolio optimization performance can be improved by shrinking the portfolio weights (instead of the moments) in estimation-based strategies (DeMiguel et al 2009b). This new form of regularizer translates the idea of weight-shrinkage from estimation-based strategies to the NN domain. Consistent with reports on estimation-based strategies, we also observe a positive impact of such regularizers for NN portfolios.
2.2.6 Rolling-window LSTM-NN portfolio
The rolling-window LSTM-NN portfolio strategy (Zhang et al 2020) operates in the same way as the FF-NN strategy, except that an LSTM cell (Hochreiter and Schmidhuber 1997; Gers et al 2000) is employed for training. Unlike the FF-NN, which operates as a function on the rolling data window, LSTM cells possess an internal state memory that allows for coordination between samples. This is crucial for tasks with temporal dependencies between inputs (Sutskever et al 2014) (such as language translation, image captioning and text summarization), where many state-of-the-art systems are built on the LSTM architecture. For portfolio optimization the presence of an internal state allows for the identification of model properties that depend on the whole data set and that carry over through individual rolling-window samples. We find that this entails a lower susceptibility to overfitting for LSTM-NN systems than for FF-NN systems.
2.2.7 Short-sale-constrained rolling-window feedforward-NN and LSTM-NN portfolios
The rolling-window feedforward-NN portfolio strategy with short-selling constraint (FF-NN-SC) and the rolling-window LSTM-NN portfolio strategy with short-selling constraint (LSTM-NN-SC) arise by imposing the condition that weights must not be negative (that is, ). In both cases this is achieved by softmax activation on the output layer.
3 Experimental setup and findings
3.1 Experimental setup
3.1.1 Data set and portfolio selection
Our analysis relies on a data set that contains the price, volume and dividend information of more than 20 000 stocks, exchange-traded funds, exchange-traded notes and other financial products listed on US exchanges between 1962 and 2022.55 5 The “end of day US stock prices” data set published by QuoteMedia. URL: https://data.nasdaq.com/databases/EOD/data. For simplicity, we restrict ourselves to assets that have a continuous price history dating back to at least January 2000. We exclude assets that are subject to market events such as delistings, stock splits and mergers and acquisitions. After filtering, 1793 assets are left in the data set. It should be mentioned that this procedure introduces bias (eg, survivorship bias) that affects the overall measured performance scores. As our focus is on the demonstration of mechanisms of overfitting and the comparison of strategies (rather than accurate performance measurement under real-world conditions), we ignore this form of bias for our experiments. For each of our experiments we create three batches of 100 portfolios of uniformly random assets. In all experiments we make use only of the closing price information.
3.1.2 FF-NN model architectures and training
The FF-NN model architecture considered in most of our experiments consists of a total of three layers, each with 64 neurons. The two inner layers are followed by a activation function. The FF-NN without SC has activation on top of the final layer and its output is additionally divided through the absolute sum of weights. For the FF-NN-SC model a softmax activation on the final layer is used instead, ensuring that weights are positive and sum to 1. In our experiments, we employ two regularized loss functions with weight regularization:
and | ||||
(see Section 2.2.5 for details). is computed on a rolling window of days. We generally exclude windows that cover a CSCV split as they would represent unrealistic market behavior along the split. We compare the performance for the following levels of regularization: , , , , and . Adam (Kingma and Ba 2017) is used as the optimization algorithm, with an initial learning rate of 0.001 and batches of size 64. Unless stated otherwise, the FF-NN models are trained for a maximum of 25 periods, with an early stopping mechanism that halts training as soon as increases on the test set. Restricting the number of training periods and using early stopping are well-established methods to combat overfitting (Yao et al 2007; Smale and Zhou 2007). Their effect is similar to loss regularization because they prevent the NN from learning the train data too closely. The data consumption of an NN is related to its architecture: fewer layers and neurons need less data for training. Restricting the number of parameters in an NN is another well-established method to prevent overfitting. It has a similar impact to adding a regularizer to the loss function (Girosi et al 1995). Within the boundaries of our statistical assessment the specific FF-NN composition initially has little impact, but as the number of neurons increases, the CSCV shows overfitting scores. We therefore mostly focus on the regularization parameter in what follows.
3.1.3 LSTM-NN model architectures and training
The LSTM-NN model architectures investigated here build on the design proposed by Zhang et al (2020). The LSTM-NN model consists of an LSTM cell of hidden dimension 64 followed by one linear layer of 64 neurons and activation. The LSTM-NN without SC has activation on top of the linear layer and its output is additionally divided through the absolute sum of weights. In the LSTM-NN-SC model a softmax activation is instead applied on top of the final layer, ensuring that weights are positive and sum to 1. As for the FF-NN, we restrict the number of neurons and we apply early stopping and weight regularization. As before, is computed on a rolling window of days. We compare the performance of our architecture for the following levels of regularization: , , , and . Adam (Kingma and Ba 2017) is used as the optimization algorithm, with an initial learning rate of 0.001 and batches of size 64. Unless stated otherwise, the LSTM-NN models are trained for a maximum of 25 periods, with an early stopping mechanism that halts training as soon as increases on the test set. We note that Zhang et al (2020) report training for 100 epochs with but without early stopping. For comparison we have included this setting (LSTM-NN-SC I) in our experiments.
3.1.4 Summary of NN models
Table 1 summarizes the architectures tested.
Max. | |||||
---|---|---|---|---|---|
train | Early | Short-sale | Regularizer | ||
Model ID | epochs | stop. | const. | Loss | |
FF-NN I | 25 | Yes | No | 0.000 | CEQ |
FF-NN II | 25 | Yes | No | 0.001 | CEQ |
FF-NN III | 25 | Yes | No | 0.010 | CEQ |
FF-NN IV | 25 | Yes | No | 0.100 | CEQ |
FF-NN V | 25 | Yes | No | 0.500 | CEQ |
FF-NN-SC I | 25 | Yes | Yes | 0.000 | CEQ |
FF-NN-SC II | 25 | Yes | Yes | 0.001 | CEQ |
FF-NN-SC III | 25 | Yes | Yes | 0.010 | CEQ |
FF-NN-SC IV | 25 | Yes | Yes | 0.100 | CEQ |
FF-NN-SC V | 25 | Yes | Yes | 0.500 | CEQ |
LSTM-NN I | 25 | Yes | No | 0.0000 | CEQ |
LSTM-NN II | 25 | Yes | No | 0.0010 | CEQ |
LSTM-NN III | 25 | Yes | No | 0.0100 | CEQ |
LSTM-NN IV | 25 | Yes | No | 0.1000 | CEQ |
LSTM-NN IV | 25 | Yes | No | 0.5000 | CEQ |
LSTM-NN-SC I | 100 | No | Yes | 0.0000 | CEQ |
LSTM-NN-SC II | 25 | Yes | Yes | 0.0000 | CEQ |
LSTM-NN-SC III | 25 | Yes | Yes | 0.0001 | CEQ |
LSTM-NN-SC IV | 25 | Yes | Yes | 0.0010 | CEQ |
LSTM-NN-SC V | 25 | Yes | Yes | 0.0100 | CEQ |
LSTM-NN-SC VI | 25 | Yes | Yes | 0.1000 | CEQ |
LSTM-NN-SC VII | 25 | Yes | Yes | 0.5000 | CEQ |
LSTM-NN-SC VIII | 25 | Yes | Yes | 0.0000 | SR |
LSTM-NN-SC IX | 25 | Yes | Yes | 0.0001 | SR |
LSTM-NN-SC X | 25 | Yes | Yes | 0.0010 | SR |
LSTM-NN-SC XI | 25 | Yes | Yes | 0.0100 | SR |
LSTM-NN-SC XII | 25 | Yes | Yes | 0.1000 | SR |
3.2 Experimental findings
Three types of experiments are conducted in this section. Section 3.2.1 shows that a careless application of holdout is insufficient to control for overfitting. Section 3.2.2 is concerned with the selection of NN model configurations via the measurement of performance metrics and RAS-CSCV. Section 3.2.3 presents benchmarking experiments comparing the performance of selected NN strategies with that of estimation-based baseline strategies.
3.2.1 The triviality of overfitting a portfolio model
We begin our experiments with a warning. Simple examples show that NN-based portfolio strategies can be tuned to deliver deceptive in-sample and out-of-sample performance metrics when the holdout procedure is employed on a specific set of assets. Although this procedure is commonly used for validation, it allows almost no conclusions to be drawn about the true performance of the respective strategies. We validate this assertion by reporting the respective CSCV and RAS-CSCV performance scores. We split the data set into training and test sets of roughly equal size lasting, respectively, from January 2000 to December 2010 and from January 2011 to December 2021.
- (1)
We consider a large FF-NN-SC model with hidden layers of 256 neurons and no early stopping. We choose a list of assets at random.66 6 The chosen assets are “ADI”, “R”, “INFY”, “AVB”, “AMZN”, “CTIC”, “PRA”, “AMGN”, “TRMK”, “BKN”. For this setup the strategy has an in-sample Sharpe ratio of 0.256 and an out-of-sample Sharpe ratio of 0.637. We stop training after 25 periods. After training, the training set Sharpe ratio is 1.482 and the test set Sharpe ratio is 0.717. If the investment strategy were chosen based on those figures alone, then the NN strategy would be preferable. Conducting CSCV with reveals an average in-sample and out-of-sample Sharpe ratio of 0.422 for the strategy. The NN delivers an average in-sample Sharpe ratio of 1.561 but an average out-of-sample Sharpe ratio of only 0.044. All CSCV pairs of (in-sample/out-of-sample) Sharpe ratios are depicted in part (a) of Figure 3, where an overall right and downward shift can be observed compared with the strategy. In other words, in comparison the trained NN performs better on training sets and worse on test sets, the good-looking Sharpe pair at being a result of overfitting. Even more convincing is the RAS-CSCV scatter plot in part (b), which shows that the NN model outperforms the strategy in only 6.5% of CSCV splits and portfolios and has a mean out-of-sample Sharpe ratio of 0.123 versus 0.584 for the strategy.
- (2)
We consider the model LSTM-NN-SC I, which is a short-sale-constrained LSTM-NN model trained for 100 periods without early stopping. We choose a list of assets at random.77 7 The chosen assets are “AEG”, “NVDA”, “SIG”, “QQQ”, “MTG”, “SRI”, “HSIC”, “REG”, “CCI”, “BCO”. For this setup the strategy has an in-sample Sharpe ratio of 0.308 and an out-of-sample Sharpe ratio of 0.700. After training, the training set Sharpe ratio is 0.638 while the test set Sharpe ratio is 0.829. If the investment strategy was chosen based on those figures alone, then the NN strategy would be preferable. Conducting CSCV with reveals an average in-sample and out-of-sample Sharpe ratio of 0.500 for the strategy. The NN delivers an average in-sample Sharpe ratio of 0.791 but an average out-of-sample Sharpe ratio of only 0.351. All CSCV pairs of (in-sample/out-of-sample) Sharpe ratios are depicted in part (a) of Figure 4, where an overall right and downward shift can be observed compared with the strategy. In other words, in comparison the trained NN performs better on training sets and worse on test sets, the good-looking Sharpe pair at being a result of overfitting. Even more convincing is the RAS-CSCV scatter plot in part (b), which shows that the NN model outperforms the strategy in only 16.5% of CSCV splits and portfolios, and has a mean out-of-sample Sharpe ratio of 0.332 versus 0.584 for the strategy.
3.2.2 Model selection via RAS-CSCV
Mean | Median | % | Mean | Median | % | |
Model | SR | SR | (SR outp.) | |||
FF-NN I | 0.120 | 0.111 | 95.6 | 0.048 | 0.259 | 16.3 |
FF-NN II | 0.029 | 0.003 | 94.3 | 0.146 | 0.311 | 19.0 |
FF-NN III | 0.146 | 0.164 | 90.6 | 0.210 | 0.374 | 19.0 |
FF-NN IV | 0.317 | 0.369 | 85.3 | 0.196 | 0.396 | 19.1 |
FF-NN V | 0.400 | 0.458 | 82.0 | 0.205 | 0.399 | 19.7 |
FF-NN-SC I | 0.427 | 0.368 | 91.3 | 0.587 | 0.581 | 43.3 |
FF-NN-SC II | 0.800 | 0.890 | 74.3 | 0.589 | 0.578 | 50.8 |
FF-NN-SC III | 0.961 | 0.967 | 59.0 | 0.585 | 0.568 | 52.5 |
FF-NN-SC IV | 0.953 | 0.965 | 59.0 | 0.586 | 0.569 | 53.1 |
FF-NN-SC V | 0.959 | 0.975 | 63.0 | 0.584 | 0.565 | 49.1 |
LSTM-NN I | 0.197 | 0.218 | 83.6 | 0.447 | 0.522 | 41.0 |
LSTM-NN II | 0.369 | 0.445 | 83.6 | 0.359 | 0.476 | 31.5 |
LSTM-NN III | 0.444 | 0.488 | 84.0 | 0.448 | 0.490 | 27.2 |
LSTM-NN IV | 0.672 | 0.745 | 71.0 | 0.468 | 0.506 | 31.0 |
LSTM-NN V | 0.716 | 0.773 | 80.0 | 0.479 | 0.517 | 29.2 |
LSTM-NN-SC I | 0.223 | 0.192 | 88.0 | 0.332 | 0.332 | 16.5 |
LSTM-NN-SC II | 0.777 | 0.868 | 75.6 | 0.597 | 0.583 | 54.1 |
LSTM-NN-SC III | 0.826 | 0.904 | 72.3 | 0.599 | 0.585 | 57.4 |
LSTM-NN-SC IV | 0.931 | 0.977 | 60.3 | 0.591 | 0.578 | 58.8 |
LSTM-NN-SC V | 0.985 | 0.992 | 45.3 | 0.588 | 0.567 | 59.8 |
LSTM-NN-SC VI | 0.985 | 0.989 | 46.3 | 0.586 | 0.568 | 54.2 |
LSTM-NN-SC VII | 0.985 | 0.990 | 48.0 | 0.585 | 0.568 | 52.1 |
LSTM-NN-SC VIII | 0.847 | 0.873 | 78.3 | 0.602 | 0.574 | 47.3 |
LSTM-NN-SC IX | 0.861 | 0.891 | 75.6 | 0.601 | 0.575 | 48.5 |
LSTM-NN-SC X | 0.850 | 0.882 | 78.0 | 0.602 | 0.575 | 49.3 |
LSTM-NN-SC XI | 0.884 | 0.918 | 71.3 | 0.604 | 0.576 | 50.4 |
LSTM-NN-SC XII | 0.958 | 0.970 | 59.0 | 0.603 | 0.580 | 60.1 |
For each model configuration we have recorded the statistics of out-of-sample performance metrics measured within the RAS-CSCV procedure. This includes the out-of-sample performance metrics of each CSCV split and for each randomly generated portfolio and the statistics of linear regression slopes. Table 2 summarizes the key test set performance indicators for different NN model choices, training procedures and regularizers. We report the mean and median of out-of-sample Sharpe ratios (the “Mean SR” and “Median SR” columns) and the mean and median of statistics (the “Mean ” and “Median ” columns), leaving the other performance metrics until the benchmark experiments in Section 3.2.3. As a baseline we count for each model configuration the number of CSCV splits and portfolios where the strategy has a higher out-of-sample Sharpe ratio than the strategy (the “% (SR outp.)” column). To monitor for shifts in CSCV plots we also count how often each strategy curve has a steeper slope than the strategy curve (the “% ()” column).
The measured performance indicators demonstrate the importance of model choice, weight shrinkage and regularization. First, we observe that FF-NN-SC architectures show inferior performance as measured by the out-of-sample Sharpe ratio (the “Mean SR”, “Median SR” and “% (SR outp.)” columns) and a stronger tendency for overfitting compared with LSTM-NN-SC architectures. This is equally true if performance is measured by the CEQ and the Sortino ratio. This observation is consistent with the reports of Zhang et al (2020, 2021).
Second, consistent with reports on estimation-based strategies, the performance of models without short-sale constraints (ie, shrinkage) is much worse than that of their constrained counterparts. It turned out to be difficult to stabilize the unconstrained NN training process due to a tendency to “reinforce” weights of large magnitude in the course of training; that is, large/small weights tend to become larger/smaller, until in certain cases training breaks down (with “NaN” loss).88 8 For CSCV we assigned an arbitrary Sharpe ratio of when a model was unable to finish training. NaN stands for “not a number”.
Third, we observe that the correct choice of regularization parameter and early stopping mechanisms is crucial for performance. This can be seen from the performance of LSTM-NN-SC models reported in the “% (SR outp.)” column. The LSTM-NN-SC I has no early stopping mechanism; the models LSTM-NN-SC II–VII have varying levels of regularization . At first the models’ performance increases with , but then it drops again at high levels of regularization as the models become comparable with the strategy.
For illustration, we analyze some of the models in view of overfitting. The FF-NN-SC I row of Table 2 shows that the unregularized FF-NN-SC model has a Sharpe ratio that is better than the strategy in 43.3% of cases and the slope coefficient is larger in 91.3% of cases. This indicates a tendency of this model to overfit to the training set. For most combinations of train–test sets and assets the out-of-sample Sharpe ratios fall behind the . Due to training, an overall “right-shift” (compared with ) of the model’s (in-sample/out-of-sample) scatters can be observed. At the same time, the larger values indicate training has a smaller impact on testing Sharpe ratios than on training Sharpe ratios. The behavior of the LSTM-NN-SC model without regularization and 100 episodes of training is similar. The outcomes of RAS-CSCV for this model are reported in Figure 5. Part (a) compares Sharpe ratio scatters from the NN model and the model. Overall, the NN model scatters are shifted to the right and down, indicating a higher in-sample and a lower out-of-sample performance (ie, overfitting). A histogram of regression slopes for the individual CSCVs is given in part (b). Regression slopes show an extreme range, from to , which indicates a lack of dependency between model performance and overall market conditions. In summary, the overall RAS-CSCV performance appears to be unsatisfactory. For comparison, Figure 1(a) shows the results of CSCV and Figure 2 shows the results of RAS-CSCV for the well-calibrated NN-LSTM-SC IV model. A linear regression on top of all scatters demonstrates that the NN out-of-sample Sharpe ratios are shifted by a small but consistent margin upward compared with the strategy. This proves that NN-LSTM-SC indeed performs better than in 58.8% of the cases, with a mean Sharpe ratio increase of 1.2%. The small size of this margin is consistent with previous reports that it is difficult to perform better than .
Figure 6 illustrates the impact of weight regularization, showing CSCV results for the NN-LSTM-SC model at different levels of . Two (random) lists of assets are selected for this comparison, where the first column shows the CSCV for the first list of assets at varying ,99 9 The chosen assets are “LSTA”,“UEIC”,“SKX”,“INVE”, “MKL”. and the second column shows the CSCV for the second list of assets at varying 1010 10 The chosen assets are “TUP”, “LSTA”, “TELL”, “TCI”, “AMG”, “BLFS”, “SPY”, “BRO”, “BKSC”, “MTR”, “JPM”, “ATNI”, “CSWC”, “CBD”, “FUSB”, “CEF”, “FRT”, “BHB”, “MAR”, “CLS”. Early stopping was always activated as an additional regularizer (and to improve performance). Some train–test splits in the CSCV of the unregularized model show very high out-of-sample Sharpe ratios (see part (a)), but this is deceptive. In part (a) the highest out-of-sample Sharpe ratio occurs at the split BC-AD with coordinates . A certain level of scepticism is in order since for the split AC-BD the NN performs better in-sample than the strategy, but not out-of-sample. Considering the list of assets in the first column and a split with a high out-of-sample Sharpe ratio, we might be tempted to conclude that the unregularized model performs well. However, part (b) of Figure 6 demonstrates that it performs worse than the strategy on a different list of assets. Note that in both cases the properly regularized model performs better than the strategy (see parts (e) and (f)).
3.2.3 Chosen model comparison against benchmarks
Model | ER | SD | Sharpe | Sortino | CEQ | DD |
---|---|---|---|---|---|---|
LSTM-NN-SC IV | 0.950 | 0.261 | 0.591 | 0.818 | 0.129 | 0.189 |
LSTM-NN-SC V | 0.908 | 0.253 | 0.588 | 0.813 | 0.126 | 0.185 |
LSTM-NN-SC XI | 0.853 | 0.234 | 0.604 | 0.801 | 0.116 | 0.179 |
LSTM-NN-SC XII | 0.880 | 0.241 | 0.603 | 0.815 | 0.120 | 0.181 |
0.895 | 0.252 | 0.584 | 0.806 | 0.125 | 0.185 | |
RND | 0.867 | 0.267 | 0.536 | 0.740 | 0.132 | 0.196 |
MinVar-SC | 0.656 | 0.186 | 0.625 | 0.791 | 0.093 | 0.152 |
MeanVar-SC | 0.417 | 0.351 | 0.223 | 0.287 | 0.175 | 0.291 |
MinVar | 0.605 | 0.194 | 0.547 | 0.707 | 0.097 | 0.156 |
MeanVar | 0.598 | 0.571 | 0.141 | 0.155 | 0.286 | 0.494 |
Complementing the discussion in Section 3.2.2, we provide a detailed comparison of the performance of selected NN models with that of estimation-based strategies. Table 3 summarizes the performance metrics. For each configuration we report the out-of-sample expected return, the standard deviation, the Sharpe ratio, the Sortino ratio, the CEQ and the DD. The figures signify the respective metrics averaged across CSCV splits () and portfolio compositions (300). We observe that among the NN strategies, the LSTM-NN-SC IV and the LSTM-NN-SC V models have the best overall scores. Both models perform better than the strategy in all performance metrics. They have comparable average Sharpe ratios but LSTM-NN-SC V has a higher Sharpe ratio than the strategy slightly more frequently. The highest average expected returns and Sortino ratios are achieved by LSTM-NN-SC IV. We observe that among the estimation-based strategies, the MinVar-SC strategy performs the best, which is consistent with the literature (see, for example, Kritzman et al 2010). MinVar-SC delivers the smallest out-of-sample standard deviation among all strategies, even compared with the unconstrained model. MinVar-SC also performs best in terms of Sharpe and CEQ scores, falling behind LSTM-NN-SC IV only in the estimation of expected returns and the Sortino ratio. It appears that LSTM-NN-SC strategies are better overall at predicting expected returns, and they take positions with higher variance. In summary, none of the tested NN portfolio models is consistently better than the MinVar-SC strategy.
Model | ER | SD | Sharpe | Sortino | CEQ | DD |
---|---|---|---|---|---|---|
LSTM-NN-SC I | 0.606 | 0.348 | 0.330 | 0.443 | 0.173 | 0.264 |
LSTM-NN-SC IV | 0.857 | 0.281 | 0.501 | 0.698 | 0.140 | 0.203 |
LSTM-NN-SC V | 0.838 | 0.281 | 0.490 | 0.683 | 0.140 | 0.203 |
LSTM-NN-SC XI | 0.800 | 0.255 | 0.526 | 0.698 | 0.127 | 0.195 |
LSTM-NN-SC XII | 0.831 | 0.267 | 0.516 | 0.701 | 0.133 | 0.197 |
0.834 | 0.282 | 0.487 | 0.678 | 0.141 | 0.204 | |
RND | 0.771 | 0.304 | 0.420 | 0.579 | 0.152 | 0.223 |
MinVar-SC | 0.646 | 0.216 | 0.513 | 0.659 | 0.107 | 0.172 |
MeanVar-SC | 0.267 | 0.402 | 0.142 | 0.185 | 0.201 | 0.337 |
MinVar | 0.623 | 0.218 | 0.488 | 0.630 | 0.109 | 0.174 |
MeanVar | 0.337 | 0.553 | 0.084 | 0.083 | 0.277 | 0.484 |
Model | ER | SD | Sharpe | Sortino | CEQ | DD |
---|---|---|---|---|---|---|
LSTM-NN-SC I | 0.607 | 0.359 | 0.325 | 0.443 | 0.179 | 0.269 |
LSTM-NN-SC IV | 0.921 | 0.252 | 0.590 | 0.821 | 0.125 | 0.183 |
LSTM-NN-SC V | 0.898 | 0.248 | 0.585 | 0.816 | 0.123 | 0.181 |
LSTM-NN-SC XI | 0.849 | 0.232 | 0.599 | 0.800 | 0.115 | 0.176 |
LSTM-NN-SC XII | 0.874 | 0.236 | 0.604 | 0.821 | 0.118 | 0.177 |
0.892 | 0.248 | 0.583 | 0.813 | 0.123 | 0.181 | |
RND | 0.870 | 0.260 | 0.542 | 0.751 | 0.129 | 0.190 |
MinVar-SC | 0.679 | 0.179 | 0.652 | 0.834 | 0.089 | 0.142 |
MeanVar-SC | 0.478 | 0.345 | 0.242 | 0.309 | 0.172 | 0.287 |
MinVar | 0.630 | 0.185 | 0.583 | 0.760 | 0.092 | 0.146 |
MeanVar | 0.441 | 0.564 | 0.119 | 0.124 | 0.282 | 0.493 |
Model | ER | SD | Sharpe | Sortino | CEQ | DD |
---|---|---|---|---|---|---|
LSTM-NN-SC I | 0.725 | 0.386 | 0.340 | 0.473 | 0.193 | 0.284 |
LSTM-NN-SC IV | 1.07 | 0.250 | 0.684 | 0.935 | 0.124 | 0.184 |
LSTM-NN-SC V | 0.989 | 0.230 | 0.689 | 0.940 | 0.115 | 0.173 |
LSTM-NN-SC XI | 0.912 | 0.217 | 0.688 | 0.907 | 0.108 | 0.168 |
LSTM-NN-SC XII | 0.936 | 0.220 | 0.691 | 0.920 | 0.110 | 0.169 |
0.961 | 0.227 | 0.685 | 0.927 | 0.113 | 0.172 | |
RND | 0.952 | 0.235 | 0.653 | 0.893 | 0.117 | 0.176 |
MinVar-SC | 0.643 | 0.179 | 0.711 | 0.879 | 0.081 | 0.141 |
MeanVar-SC | 0.505 | 0.345 | 0.285 | 0.368 | 0.153 | 0.250 |
MinVar | 0.563 | 0.179 | 0.586 | 0.732 | 0.089 | 0.150 |
MeanVar | 0.823 | 0.597 | 0.222 | 0.257 | 0.299 | 0.505 |
The RND row shows the performance indicators for randomly assigned short-sale constrained weights. These figures indicate roughly “how likely it is to choose a good portfolio by chance”. For NN training it is desirable to have a specific loss signal. But the relatively high-performance metrics of RND (and ) imply that loss signals computed from these metrics might be deceptive in that even an invalid model could have low losses. As expected, the performance scores of the MeanVar strategy are much lower than those of the other strategies. Tables 4–6 show the resolution of the averaged quantities with respect to portfolios of a given size . It is interesting to observe that the portfolios’ variances decrease with for the estimation-based models and for LSTM-NN-SC IV and V, but increase for the overfitting model LSTM-NN-SC I.
4 Conclusions
We investigated the use of NNs for the optimal allocation of wealth among given assets. We argued that a combination of factors leads to NN-based portfolio optimization being considerably vulnerable to overfitting. These factors include the typically small quantity of historic market data available, a low signal-to-noise ratio, the specific procedure of NN training and calibration as well as a relatively high chance of picking a portfolio with high out-of-sample performance. This motivated a comparison of NN portfolios with classical strategies used as dedicated overfitting control mechanisms. To control for overfitting we introduced the RAS-CSCV procedure; to eliminate bias from the choice of the holdout period, the training and testing procedures were iterated over different combinations of time periods. To eliminate bias from the choice of assets, the cross-validation procedure was enhanced by choosing portfolios of different size at random. Within this scheme we reported the empirical performance of almost 30 NN models. Our experiments involved measuring the performance of many variations of the reported models (eg, with varying NN structures, initial learning rates, batch sizes and regularizers), which, however, we chose not to report due to their similarity to the presented models. We found that regularization plays a particularly important role in controlling overfitting for portfolio optimization. Three types of regularization are employed in our experiments.
- (1)
We chose comparatively small NNs.
- (2)
We stopped the training procedure early, after a maximum of 25 periods.
- (3)
We suggested a weight-regularizer with a structure inspired by weight-shrinkage estimators for estimation-based portfolios.
For our data set we observe that across portfolio compositions, chosen holdout periods and performance metrics, well-chosen NN portfolios indeed perform better out-of-sample than the benchmark. However, of all the tested NN models none performs consistently better out-of-sample than the short-sale constrained minimum-variance portfolio.
An important advantage of NN strategies is that they allow for the flexible integration of additional market features beyond the historic asset data. Including informative features is likely to improve their performance. Thus, systems with multiple data sources are more likely to be employed in practice. Yet, the aforementioned overfitting mechanisms persist even for complex systems. In fact, overfitting control is even more important for larger systems because the number of individual components might divert attention away from specific overfitting mechanisms. The drivers of overfitting that we have presented are specific to the portfolio optimization task and represent a risk for machine learning systems beyond the application of NNs. We therefore expect RAS-CSCV to be similarly important for the validation and overfitting control of other machine learning systems (such as kernel regression or random forests). In general, portfolio optimization comes with a significant risk of overfitting, exposing performance figures validated using only the common holdout procedure to high levels of risk.
Declaration of interest
The authors report no conflicts of interest. The authors alone are responsible for the content and writing of the paper.
References
- Bailey, D. H., Borwein, J., López de Prado, M., and Zhu, Q. J. (2014). Pseudo-mathematics and financial charlatanism: the effects of backtest overfitting on out-of-sample performance. Notices of the American Mathematical Society 61(5), 458–471 (https://doi.org/10.1090/noti1105).
- Bailey, D. H., Borwein, J., López de Prado, M., and Zhu, Q. J. (2017). The probability of backtest overfitting. The Journal of Computational Finance 20(4), 39–69 (https://doi.org/10.21314/JCF.2016.322).
- Barry, C. B. (1974). Portfolio analysis under uncertain means, variances, and covariances. Journal of Finance 29(2), 515–522 (https://doi.org/10.1111/j.1540-6261.1974.tb03064.x).
- Bawa, V. S., Brown, S. J., and Klein, R. (1979). Estimation Risk and Optimal Portfolio Choice. North-Holland.
- Benartzi, S., and Thaler, R. H. (2001). Naive diversification strategies in defined contribution saving plans. American Economic Review 91(1), 79–98 (https://doi.org/10.1257/aer.91.1.79).
- Best, M. J., and Grauer, R. R. (1992). Positively weighted minimum-variance portfolios and the structure of asset expected returns. Journal of Financial and Quantitative Analysis 27(4), 513–537 (https://doi.org/10.2307/2331138).
- Best, M. J., and Grauer, R. R. (2015). On the sensitivity of mean–variance-efficient portfolios to changes in asset means: some analytical and computational results. Review of Financial Studies 4(2), 315–342 (https://doi.org/10.1093/rfs/4.2.315).
- Bishop, C. M. (2007). Pattern Recognition and Machine Learning. Information Science and Statistics. Springer.
- Bloomfield, T., Leftwich, R., and Long, J. B., Jr. (1977). Portfolio strategies and performance. Journal of Financial Economics 5(2), 201–218 (https://doi.org/10.1016/0304-405X(77)90018-6).
- Chan, L. K., Karceski, J., and Lakonishok, J. (1999). On portfolio optimization: forecasting covariances and choosing the risk model. Working Paper 7039, National Bureau of Economic Research (https://doi.org/10.3386/w7039).
- Chopra, V. K. (1993). Improving optimization. Journal of Investing 2(3), 51–59 (https://doi.org/10.3905/joi.2.3.51).
- CireÅ̧an, D. C., Meier, U., and Schmidhuber, J. (2012). Transfer learning for Latin and Chinese characters with deep neural networks. In 2012 International Joint Conference on Neural Networks, pp. 1–6. IEEE Press, Piscataway, NJ (https://doi.org/10.1109/IJCNN.2012.6252544).
- DeMiguel, V., Garlappi, L., and Uppal, R. (2009a). Optimal versus naive diversification: how inefficient is the portfolio strategy? Review of Financial Studies 22(5), 1915–1953 (https://doi.org/10.1093/rfs/hhm075).
- DeMiguel, V., Garlappi, L., Nogales, F. J., and Uppal, R. (2009b). A generalized approach to portfolio optimization: improving performance by constraining portfolio norms. Management Science 55(5), 798–812 (https://doi.org/10.1287/mnsc.1080.0986).
- Du, J. (2022). Mean–variance portfolio optimization with deep learning based-forecasts for cointegrated stocks. Expert Systems with Applications 201, Paper 117005 (https://doi.org/10.1016/j.eswa.2022.117005).
- Economist (2013). Trouble at the lab. Economist, October 18. URL: http://www.economist.com/briefing/2013/10/18/trouble-at-the-lab.
- Efron, B., and Morris, C. (1973). Stein’s estimation rule and its competitors: an empirical Bayes approach. Journal of the American Statistical Association 68(341), 117–130 (https://doi.org/10.1080/01621459.1973.10481350).
- Epstein, I., and Simon, M. (eds) (1960). Hebrew–English Edition of the Babylonian Talmud. Soncino Press, London.
- Fama, E. F., and French, K. R. (2004). The capital asset pricing model: theory and evidence. Journal of Economic Perspectives 18(3), 25–46 (https://doi.org/10.1257/0895330042162430).
- Freitas, F. D., De Souza, A. F., and de Almeida, A. R. (2009). Prediction-based portfolio optimization model using neural networks. Neurocomputing 72(10–12), 2155–2170 (https://doi.org/10.1016/j.neucom.2008.08.019).
- Frost, P. A., and Savarino, J. E. (1988). For better performance: constrain portfolio weights. Journal of Portfolio Management 15(1), 29–34 (https://doi.org/10.3905/jpm.1988.409181).
- Garlappi, L., Uppal, R., and Wang, T. (2007). Portfolio selection with parameter and model uncertainty: a multi-prior approach. Review of Financial Studies 20(1), 41–81 (https://doi.org/10.1093/rfs/hhl003).
- Gers, F. A., Schmidhuber, J., and Cummins, F. (2000). Learning to forget: continual prediction with LSTM. Neural Computation 12(10), 2451–2471 (https://doi.org/10.1162/089976600300015015).
- Girosi, F., Jones, M., and Poggio, T. (1995). Regularization theory and neural networks architectures. Neural Computation 7(2), 219–269 (https://doi.org/10.1162/neco.1995.7.2.219).
- Goldfarb, D., and Iyengar, G. (2003). Robust portfolio selection problems. Mathematics of Operations Research 28(1), 1–38 (https://doi.org/10.1287/moor.28.1.1.14260).
- Hawkins, D. M. (2004). The problem of overfitting. Journal of Chemical Information and Computer Sciences 44(1), 1–12 (https://doi.org/10.1021/ci0342472).
- Hochreiter, S., and Schmidhuber, J. (1997). Long short-term memory. Neural Computation 9(8), 1735–1780 (https://doi.org/10.1162/neco.1997.9.8.1735).
- Hodges, S., and Brealey, R. (1972). Portfolio selection in a dynamic and uncertain world. Financial Analysts Journal 28(6), 58–69 (https://doi.org/10.2469/faj.v28.n6.58).
- Huberman, G., and Jiang, W. (2006). Offering versus choice in 401(k) plans: equity exposure and number of funds. Journal of Finance 61(2), 763–801 (https://doi.org/10.1111/j.1540-6261.2006.00854.x).
- Ioannidis, J. P. A. (2005). Why most published research findings are false. PLoS Medicine 2(8), Paper e124 (https://doi.org/10.1371/journal.pmed.0020124).
- Jagannathan, R., and Ma, T. (2003). Risk reduction in large portfolios: why imposing the wrong constraints helps. Journal of Finance 58(4), 1651–1683 (https://doi.org/10.1111/1540-6261.00580).
- James, W., and Stein, C. (1992). Estimation with quadratic loss. In Breakthroughs in Statistics, Kotz, S., and Johnson, N. L. (eds), pp. 443–460. Springer Series in Statistics. Springer (https://doi.org/10.1007/978-1-4612-0919-5_30).
- Jobson, J. D., and Korkie, B. (1980). Estimation for Markowitz efficient portfolios. Journal of the American Statistical Association 75(371), 544–554 (https://doi.org/10.1080/01621459.1980.10477507).
- Jorion, P. (1985). International portfolio diversification with estimation risk. Journal of Business 58(3), 259–278 (https://doi.org/10.1086/296296).
- Jorion, P. (1991). Bayesian and CAPM estimators of the means: implications for portfolio selection. Journal of Banking and Finance 15(3), 717–727 (https://doi.org/10.1016/0378-4266(91)90094-3).
- Kan, R., and Zhou, G. (2007). Optimal portfolio choice with parameter uncertainty. Journal of Financial and Quantitative Analysis 42(3), 621–656 (https://doi.org/10.1017/S0022109000004129).
- Kingma, D. P., and Ba, J. (2017). Adam: a method for stochastic optimization. Preprint (arXiv:1412.6980) (https://doi.org/10.48550/arXiv.1412.6980).
- Korkie, R., Jobson, D., and Ratti, V. (1979). Improved estimation for Markowitz portfolios using James–Stein estimators. In Proceedings of the American Statistical Association, Business and Economics Statistics Section, Volume 71, pp. 279–284. American Statistical Association, Washington, DC.
- Kritzman, M., Page, S., and Turkington, D. (2010). In defense of optimization: the fallacy of . Financial Analysts Journal 66(2), 31–39 (https://doi.org/10.2469/faj.v66.n2.6).
- Ledoit, O., and Wolf, M. (2004a). Honey, I shrunk the sample covariance matrix. Journal of Portfolio Management 30(4), 110–119 (https://doi.org/10.3905/jpm.2004.110).
- Ledoit, O., and Wolf, M. (2004b). A well-conditioned estimator for large-dimensional covariance matrices. Journal of Multivariate Analysis 88(2), 365–411 (https://doi.org/10.1016/S0047-259X(03)00096-4).
- Litterman, R. B. (2003). Modern Investment Management: An Equilibrium Approach. Wiley.
- Ma, Y., Han, R., and Wang, W. (2020). Prediction-based portfolio optimization models using deep neural networks. IEEE Access 8, 115393–115405 (https://doi.org/10.1109/ACCESS.2020.3003819).
- MacKinlay, A. C., and Pástor, L. (2000). Asset pricing models: implications for expected returns and portfolio selection. Review of Financial Studies 13(4), 883–916 (https://doi.org/10.1093/rfs/13.4.883).
- Makridakis, S., Gaba, A., Hollyman, R., Petropoulos, F., Spiliotis, E., and Swanson, N. (2022). The M6 financial duathlon competition. Guidelines, M Open Forecasting Center. URL: https://mofc.unic.ac.cy/wp-content/uploads/2022/02/M6-Guidelines-1.pdf.
- Markowitz, H. (1952). Portfolio selection. Journal of Finance 7(1), 77–91 (https://doi.org/10.1111/j.1540-6261.1952.tb01525.x).
- Merton, R. C. (1980). On estimating the expected return on the market: an exploratory investigation. Journal of Financial Economics 8(4), 323–361 (https://doi.org/10.1016/0304-405X(80)90007-0).
- Michaud, R. O. (1989). The Markowitz optimization enigma: is “optimized” optimal? Financial Analysts Journal 45(1), 31–42 (https://doi.org/10.2469/faj.v45.n1.31).
- Pástor, L. (2000). Portfolio selection and asset pricing models. Journal of Finance 55(1), 179–223 (https://doi.org/10.1111/0022-1082.00204).
- Pástor, L., and Stambaugh, R. F. (2000). Comparing asset pricing models: an investment perspective. Journal of Financial Economics 56(3), 335–381 (https://doi.org/10.1016/S0304-405X(00)00044-1).
- Riquelme, C., Tucker, G., and Snoek, J. (2018). Deep Bayesian bandits showdown: an empirical comparison of Bayesian deep networks for Thompson sampling. Preprint (arXiv:1802.09127 [stat.ML]) (https://doi.org/10.48550/arxiv.1802.09127).
- Schmidhuber, J. (2015). Deep learning in neural networks: an overview. Neural Networks 61, 85–117 (https://doi.org/10.1016/j.neunet.2014.09.003).
- Shen, W., Wang, J., Jiang, Y.-G., and Zha, H. (2015). Portfolio choices with orthogonal bandit learning. In Proceedings of the 24th International Joint Conference on Artificial Intelligence, pp. 974–980. AAAI Press, Washington, DC.
- Smale, S., and Zhou, D.-X. (2007). Learning theory estimates via integral operators and their approximations. Constructive Approximation 26(2), 153–172 (https://doi.org/10.1007/s00365-006-0659-y).
- Sutskever, I., Vinyals, O., and Le, Q. V. (2014). Sequence to sequence learning with neural networks. Preprint (arXiv:1409.3215 [cs.CL]) (https://doi.org/10.48550/arxiv.1409.3215).
- Weiss, S. M., and Kulikowski, C. A. (1991). Computer Systems That Learn: Classification and Prediction Methods from Statistics, Neural Nets, Machine Learning, and Expert Systems. Morgan Kaufmann, San Francisco, CA.
- Xiaoguang, H., and Feng, F. (2017). Risk-aware multi-armed bandit problem with application to portfolio selection. Royal Society Open Science 4(11), Paper 171377 (https://doi.org/10.1098/rsos.171377).
- Yao, Y., Rosasco, L., and Caponnetto, A. (2007). On early stopping in gradient descent learning. Constructive Approximation 26(2), 289–315 (https://doi.org/10.1007/s00365-006-0663-2).
- Zhang, C., Zhang, Z., Cucuringu, M., and Zohren, S. (2021). A universal end-to-end approach to portfolio optimization via deep learning. Preprint (arXiv:2111.09170 [q-fin.PM]) (https://doi.org/10.48550/arXiv.2111.09170).
- Zhang, Z., Zohren, S., and Roberts, S. (2020). Deep learning for portfolio optimization. Journal of Financial Data Science 2(4), 8–20 (https://doi.org/10.3905/jfds.2020.1.042).
Only users who have a paid subscription or are part of a corporate subscription are able to print or copy content.
To access these options, along with all other subscription benefits, please contact info@risk.net or view our subscription options here: http://subscriptions.risk.net/subscribe
You are currently unable to print this content. Please contact info@risk.net to find out more.
You are currently unable to copy this content. Please contact info@risk.net to find out more.
Copyright Infopro Digital Limited. All rights reserved.
As outlined in our terms and conditions, https://www.infopro-digital.com/terms-and-conditions/subscriptions/ (point 2.4), printing is limited to a single copy.
If you would like to purchase additional rights please email info@risk.net
Copyright Infopro Digital Limited. All rights reserved.
You may share this content using our article tools. As outlined in our terms and conditions, https://www.infopro-digital.com/terms-and-conditions/subscriptions/ (clause 2.4), an Authorised User may only make one copy of the materials for their own personal use. You must also comply with the restrictions in clause 2.5.
If you would like to purchase additional rights please email info@risk.net