# Journal of Computational Finance

**ISSN:**

1460-1559 (print)

1755-2850 (online)

**Editor-in-chief:** Christoph Reisinger

# Dilated convolutional neural networks for time series forecasting

Anastasia Borovykh, Sander Bohte and Cornelis W. Oosterlee

####
Abstract

We present a method for conditional time series forecasting based on an adaptation of the recent deep convolutional WaveNet architecture. The proposed network contains stacks of dilated convolutions that allow it to access a broad range of historical data when forecasting. It also uses a rectified linear unit (ReLU) activation function, and conditioning is performed by applying multiple convolutional filters in parallel to separate time series, which allows for the fast processing of data and the exploitation of the correlation structure between the multivariate time series. We test and analyze the performance of the convolutional network both unconditionally and conditionally for financial time series forecasting using the Standard & Poor’s 500 index, the volatility index, the Chicago Board Options Exchange interest rate and several exchange rates, and we extensively compare its performance with those of the well-known autoregressive model and a long short-term memory network. We show that a convolutional network is well suited to regression-type problems and is able to effectively learn dependencies in and between the series without the need for long historical time series, that it is a time-efficient and easy-to-implement alternative to recurrent-type networks, and that it tends to outperform linear and recurrent models.

####
Introduction

## 1 Introduction

Forecasting financial time series using past observations has been a topic of significant interest for obvious reasons. It is well known that while temporal relationships exist in the data, they are difficult to analyze and predict accurately due to the nonlinear trends, heavy tails and noise that are present in the series (Cont 2001). In developing models for forecasting financial data, it is desirable that they will both be able to learn nonlinear dependencies in the data and have high noise resistance. Traditional autoregressive models such as vector autoregression (VAR) and autoregressive moving-average (ARMA) (Hamilton 1994) fail to capture nonlinear patterns. Feedforward neural networks have been a popular way of learning dependencies in the data, as they allow us to learn nonlinearities without the need to specify a particular model form in advance (see Zhang et al 1998; Chakraborty et al 1992). Hybrid approaches using neural networks and econometric models have also been proposed (see, for example, Zhang 2003). One downside of classical feedforward neural networks is that a large sample size of data is required to obtain a stable forecasting result.

The main focus of this paper is on multivariate time series forecasting, and specifically financial time series. In particular, we forecast time series conditional on other, related series. Financial time series are known both to have a high noise component and to be of limited duration: even when long histories of stock prices are available, using them can be difficult due to the changing financial environment. At the same time, many different, but strongly correlated, financial time series exist. Here, we aim to exploit multivariate forecasting using the notion of conditioning to reduce the noisiness in short-duration series. Effectively, we use multiple financial time series as inputs into a neural network, thereby conditioning the forecast of a time series on both its own history and on that of multiple other time series. Training a model on multiple stock series allows the network to exploit the correlation structure between these series so that the network can learn the market dynamics in shorter sequences of data. As shown by, for example, Zheng et al (2016) for classification, using multiple conditional time series as inputs can improve both the robustness and the forecast quality of the model by learning about long-term temporal dependencies between series.

A convolutional neural network (CNN) (see LeCun et al 1998) is a biologically inspired type of deep neural network (DNN) that has recently gained popularity due to its success in classification problems (eg, image recognition (Krizhevsky et al 2012) and time series classification (Wang et al 2016)). The CNN consists of a sequence of convolutional layers, the output of which is connected only to local regions in the input. This is achieved by sliding a filter, or weight matrix, over the input and computing the dot product between the two at each point (ie, a convolution between the input and the filter). This structure allows the model to learn filters that are able to recognize specific patterns in the input data. Recent advances in CNNs for time series forecasting can be found in Mittelman (2015), where the authors propose an undecimated convolutional network for time series modeling based on the undecimated wavelet transform, and in Bińkowski et al (2017), in which the authors propose using an autoregressive-type weighting system for forecasting financial time series, where the weights are allowed to be data-dependent by being learned through a CNN. In general, literature on financial time series forecasting with convolutional architectures is still scarce, as these types of networks are much more commonly applied in classification problems. Intuitively, the rationale for applying CNNs to time series forecasting would be to learn filters that represent certain repeating patterns in the series and then to use these to forecast the future values. Due to the layered structure of CNNs, they might work well on noisy series, by discarding the noise in each subsequent layer and extracting only the meaningful patterns, and in this way they would be similar to neural networks that use wavelet-transformed time series (ie, a split in high- and low-frequency components) as input (see, for example, Aussem and Murtagh 1997; Lahmiri 2014).

Currently, recurrent neural networks (RNNs), and in particular the long short-term memory (LSTM) unit (Hochreiter and Schmidhuber 1997; Chung et al 2014), represent the state-of-the-art in time series forecasting (see also Hsu (2017) and, in particular, Fisher and Krauss (2017) for financial forecasting results). The efficiency of these networks can be explained by the recurrent connections that allow the network to access the entire history of previous time series values. Alternatively, one might employ a CNN with multiple layers of dilated convolutions (Yu and Koltun 2015). The dilated convolutions, in which the filter is applied by skipping certain elements in the input, allow for the receptive field of the network to grow exponentially, thereby allowing the network to access a broad range of history (as is the case for the RNN). The advantage of the CNN over the recurrent-type network is that due to the convolutional structure of the network, the number of trainable weights is small, resulting in much more efficient training and prediction.

Motivated by van den Oord et al (2016c) – in which the performance of the PixelCNN is compared with that of the PixelRNN (van den Oord et al 2016b), a network used for image generation – in this paper we aim to investigate how the performance of the CNN compares with that of autoregressive and recurrent models of forecasting noisy financial time series. The CNN we employ is a network inspired by the convolutional WaveNet model from van den Oord et al (2016a) that was first developed for audio forecasting, whose structure we simplify and optimize for multivariate time series forecasting. Our network focuses on learning long-term relationships in and between multivariate, noisy time series. As in van den Oord et al (2016a), we make use of dilated convolutions; however, these convolutions are applied with parameterized skip connections (He et al 2016) from both the input time series and the time series we condition on, and in this way we learn long- and short-term interdependencies in an efficient manner. Further, the gated activation function from the original WaveNet model is replaced with a rectified linear unit (ReLU), simplifying the model and reducing the training time.

This paper consists of several main contributions. First, we present a CNN inspired by the WaveNet model but with a structure that is simplified and optimized for time series forecasting, ie, using a ReLU activation and a novel and more optimal way of conditioning with parameterized skip connections. Second, knowing the strong performance of CNNs on classification problems, our work is – to the best of our knowledge – the first to show that they can be applied successfully to forecasting financial time series of limited length. By conducting an extensive analysis of the WaveNet model and comparing the performance with that of an LSTM (which represents the current state-of-the-art when it comes to forecasting) and an autoregressive model popular in econometrics, our paper shows that the WaveNet model is a time-efficient and easy-to-implement alternative to recurrent-type networks and that it tends to outperform the linear and recurrent models. Finally, using examples on artificial time series as well as the Standard & Poor’s 500 (S&P 500) index, the volatility index, the Chicago Board Options Exchange (CBOE) interest rate and five exchange rates, we show that the WaveNet model’s efficient conditioning method enables one to extract temporal relationships between time series, thereby improving the forecast while at the same time limiting the requirement for a long historical price series and reducing the noise, since it allows one to exploit the correlations between related time series. As a whole, we show that convolutional networks can be a much simpler and easier-to-train alternative to recurrent networks while achieving accuracy that is at least as good or better on nonlinear, noisy forecasting tasks.

## 2 The model

In this section, we start with a review of neural networks and CNNs. Then, we introduce the particular convolutional network structure that will be used for time series forecasting.

### 2.1 Background

#### 2.1.1 Feedforward neural networks

A basic feedforward neural network consists of $L$ layers with ${M}_{l}$ hidden nodes in each layer $l=1,\mathrm{\dots},L$. Suppose we are given as input $x(1),\mathrm{\dots},x(t)$ and we want to use the multilayer neural network to output the forecasted value at the next time step $\widehat{x}(t+1)$. In the first layer, we construct ${M}_{1}$ linear combinations of the input variables in the form

${a}^{1}(i)={\displaystyle \sum _{j=1}^{t}}{w}^{1}(i,j)x(j)+{b}^{1}(i)\mathit{\hspace{1em}}\text{for}i=1,\mathrm{\dots},{M}_{1},$ | (2.1) |

where ${w}^{1}\in {\mathbb{R}}^{{M}_{1}\times t}$ are referred to as the weights and ${b}^{1}\in {\mathbb{R}}^{{M}_{1}}$ as the biases. Each of the outputs ${a}^{1}(i)$, $i=1,\mathrm{\dots},{M}_{1}$, are then transformed using a differentiable, nonlinear activation function $h(\cdot )$ to give

${f}^{1}(i)=h({a}^{1}(i))\mathit{\hspace{1em}}\text{for}i=1,\mathrm{\dots},{M}_{1}.$ | (2.2) |

The nonlinear function enables the model to learn nonlinear relations between the data points. In every subsequent layer $l=2,\mathrm{\dots},L-1$, the outputs from the previous layer ${f}^{l-1}$ are again linearly combined and passed through the nonlinearity

${f}^{l}(i)=h\left({\displaystyle \sum _{j=0}^{{M}_{l-1}}}{w}^{l}(i,j){f}^{l-1}(j)+{b}^{l}(j)\right)\mathit{\hspace{1em}}\text{for}i=1,\mathrm{\dots},{M}_{1},$ | (2.3) |

with ${w}^{l}\in {\mathbb{R}}^{{M}_{l}\times {M}_{l-1}}$ and ${b}^{l}\in {\mathbb{R}}_{l}^{M}$. In the final layer $l=L$ of the neural network, the forecasted value $\widehat{x}(t+1)$ is computed using

$\widehat{x}(t+1)=h\left({\displaystyle \sum _{j=0}^{{M}_{L-1}}}{w}^{L}(j){f}^{L-1}(j)+{b}^{L}\right),$ | (2.4) |

with ${w}^{L}\in {\mathbb{R}}^{1\times {M}_{L-1}}$ and ${b}^{L}\in \mathbb{R}$. In a neural network, every node is thus connected to every node in adjacent layers (see Figure 1).

#### 2.1.2 Convolutions

A discrete convolution of two one-dimensional signals $f$ and $g$, written as $f*g$, is defined as

$(f*g)(i)$ | $={\displaystyle \sum _{j=-\mathrm{\infty}}^{\mathrm{\infty}}}f(j)g(i-j),$ | (2.5) |

where, depending on the definition of the convolution, nonexistent samples in the input may be defined to have values of zero, often referred to as zero padding, or the product is only computed at points where samples exist in both signals. Note that a convolution is commutative, ie, $(f*g)=(g*f)$. If the signals are finite, the infinite convolution may be truncated. In other words, if $f=[f(0),\mathrm{\dots},f(N-1)]$ and $g=[g(0),\mathrm{\dots},g(M-1)]$, then the convolution of the two is given by

$(f*g)(i)={\displaystyle \sum _{j=0}^{M-1}}f(j)g(i-j).$ | (2.6) |

The size of the convolution output depends on the way undefined samples are handled. If a certain proportion of the undefined samples is set to zero, this is referred to as zero padding. If we do not apply zero padding, the output has size $N-M+1$ (so that $i=0,\mathrm{\dots},N-M$), while padding with $p$ zeros on both sides of the input signal $f$ results in an output of size $N-M+2p+1$. The zero padding thus allows one to control the output size of the convolution, adjusting it to be either decreasing, the same or increasing with respect to the input size. A convolution at point $i$ is thus computed by shifting the signal $g$ over the input $f$ along $j$ and computing the weighted sum of the two.

#### 2.1.3 Convolutional neural networks

CNNs were developed with the idea of local connectivity. Each node is connected only to a local region in the input (see Figure 1). The spatial extent of this connectivity is referred to as the receptive field of the node. The local connectivity is achieved by replacing the weighted sums from the neural network with convolutions. In each layer of the CNN, the input is convolved with the weight matrix (also called the filter) to create a feature map. In other words, the weight matrix slides over the input and computes the dot product between the input and the weight matrix. Note that, unlike for regular neural networks, all the values in the output feature map share the same weights. This means that all the nodes in the output detect exactly the same pattern. The local connectivity and shared weights aspect of CNNs reduces the total number of learnable parameters, resulting in more efficient training. The intuition behind a CNN is therefore to learn in each layer a weight matrix that will be able to extract the necessary, translation-invariant features from the input.

The input to a convolutional layer is usually taken to be three dimensional: the height, the weight and the number of channels. In the first layer, this input is convolved with a set of ${M}_{1}$ three-dimensional filters applied over all the input channels (in other words, the third dimension of the filter map is always equal to the number of channels in the input) to create the feature output map. Now consider a one-dimensional input $x={({x}_{t})}_{t=0}^{N-1}$ of size $N$ with no zero padding. The output feature map from the first layer is then given by convolving each filter ${w}_{h}^{1}$ for $h=1,\mathrm{\dots},{M}_{1}$ with the input:

${a}^{1}(i,h)=({w}_{h}^{1}*x)(i)={\displaystyle \sum _{j=-\mathrm{\infty}}^{\mathrm{\infty}}}{w}_{h}^{1}(j)x(i-j),$ | (2.7) |

where ${w}_{h}^{1}\in {\mathbb{R}}^{1\times k\times 1}$ and ${a}^{1}\in {\mathbb{R}}^{1\times N-k+1\times {M}_{1}}$. Note that since the number of input channels in this case is one, the weight matrix also has only one channel. As for the feedforward neural network, this output is then passed through the nonlinearity $h(\cdot )$ to give ${f}^{1}=h({a}^{1})$.

In each subsequent layer $l=2,\mathrm{\dots},L$, the input feature map

$${f}^{l-1}\in {\mathbb{R}}^{1\times {N}_{l-1}\times {M}_{l-1}},$$ |

where $1\times {N}_{l-1}\times {M}_{l-1}$ is the size of the output filter map from the previous convolution with ${N}_{l-1}={N}_{l-2}-k+1$, is convolved with a set of ${M}_{l}$ filters ${w}_{h}^{l}\in {\mathbb{R}}^{1\times k\times {M}_{l-1}}$, $h=1,\mathrm{\dots},{M}_{l}$, to create a feature map ${a}^{l}\in {\mathbb{R}}^{1\times {N}_{l}\times {M}_{l}}$:

${a}^{l}(i,h)=({w}_{h}^{l}*{f}^{l-1})(i)={\displaystyle \sum _{j=-\mathrm{\infty}}^{\mathrm{\infty}}}{\displaystyle \sum _{m=1}^{{M}_{l-1}}}{w}_{h}^{l}(j,m){f}^{l-1}(i-j,m).$ | (2.8) |

The output of this is then passed through the nonlinearity to give ${f}^{l}=h({a}^{l})$. The filter size parameter $k$ thus controls the receptive field of each output node. Without zero padding, the convolution output in every layer has width ${N}_{l}={N}_{l-1}-k+1$ for $l=1,\mathrm{\dots},L$. Since all the elements in a feature map share the same weights, this allows features to be detected in a time-invariant manner, while at the same time reducing the number of trainable parameters. The output of the network after $L$ convolutional layers will be the matrix ${f}^{L}$, the size of which depends on the filter size and the number of filters used in the final layer. Depending on what we want our model to learn, the weights in the model are trained to minimize the error between the output from the network ${f}^{L}$ and the true output in which we are interested.

### 2.2 Structure

Consider a one-dimensional time series $x={({x}_{t})}_{t=0}^{N-1}$. Given a model with parameter value $\theta $, the task for a predictor is to output the next value $\widehat{x}(t+1)$ conditional on the history of the series, $x(0),\mathrm{\dots},x(t)$. This can be done by maximizing the likelihood function

$$p(x\mid \theta )=\prod _{t=0}^{N-1}p(x(t+1)\mid x(0),\mathrm{\dots},x(t),\theta ).$$ | (2.9) |

To learn this likelihood function, we present a CNN in the form of the WaveNet architecture (van den Oord et al 2016a) augmented with a number of recent architectural improvements for neural networks such that the architecture can be applied successfully to time series prediction.

Time series often display long-term correlations, so to enable the network to learn these long-term dependencies we use stacked layers of dilated convolutions. As introduced in Yu and Koltun (2015), a dilated convolution outputs a stack of ${M}_{l}$ feature maps given by

$$({w}_{h}^{l}{*}_{d}{f}^{l-1})(i)=\sum _{j=-\mathrm{\infty}}^{\mathrm{\infty}}\sum _{m=1}^{{M}_{l-1}}{w}_{h}^{l}(j,m){f}^{l-1}(i-d\cdot j,m),$$ | (2.10) |

where $d$ is the dilation factor and ${M}_{l}$ is the number of channels. In other words, in a dilated convolution, the filter is applied to every $d$th element in the input vector, allowing the model to efficiently learn connections between far-apart data points. We use an architecture similar to those in Yu and Koltun (2015) and van den Oord et al (2016a), with $L$ layers of dilated convolutions $l=1,\mathrm{\dots},L$ and with the dilations increasing by a factor of two: $d\in [{2}^{0},{2}^{1},\mathrm{\dots},{2}^{L-1}]$. The filters $w$ are chosen to be of size $1\times k:=1\times 2$. An example of a three-layer dilated convolutional network is shown in Figure 2. Using the dilated convolutions instead of regular ones allows the output $y$ to be influenced by more nodes in the input. The input of the network is given by the time series $x={({x}_{t})}_{t=0}^{N-1}$. In each subsequent layer, we apply the dilated convolution, followed by a nonlinearity, giving the output feature maps ${f}^{l}$, $l=1,\mathrm{\dots},L$. These $L$ layers of dilated convolutions are then followed by a $1\times 1$ convolution, which reduces the number of channels back to one, so that the model outputs a one-dimensional vector. Since we are interested in forecasting the subsequent values of the time series, we will train the model so that this output is the forecasted time series $\widehat{x}={({\widehat{x}}_{t})}_{t=0}^{N-1}$.

The receptive field of a neuron was defined as the set of elements in its input that modifies the output value of that neuron. Now, we define the receptive field $r$ of the model to be the number of neurons in the input in the first layer, ie, the time series, that can modify the output in the final layer, ie, the forecasted time series. This then depends on the number of layers $L$ and the filter size $k$, and it is given by

$r:={2}^{L-1}k.$ | (2.11) |

In Figure 2, the receptive field is given by $r=8$, meaning that one output value is influenced by eight input neurons.

As mentioned before, sometimes it is convenient to pad the input with zeros around the border. The size of this zero padding then controls the size of the output. In our case, so as not to violate the adaptability constraint on $x$, we want to make sure that the receptive field of the network when predicting $x(t+1)$ contains only $x(0),\mathrm{\dots},x(t)$. To do this, we use causal convolutions, where the word causal indicates that the convolution output should not depend on future inputs. In time series, this is equivalent to padding the input with a vector of zeros of the size of the receptive field, so that the input is given by

$[0,\mathrm{\dots},0,x(0),\mathrm{\dots},x(N-1)]\in {\mathbb{R}}^{N+r}$ | (2.12) |

and the output of the $L$-layer WaveNet is

$[\widehat{x}(0),\mathrm{\dots},\widehat{x}(N)]\in {\mathbb{R}}^{N+1}.$ | (2.13) |

At training time, the prediction of $x(1),\mathrm{\dots},x(N)$ is thus computed by convolving the input with the kernels in each layer $l=1,\mathrm{\dots},L$, followed by the $1\times 1$ convolution. At testing time, a one-step-ahead prediction $\widehat{x}(t+1)$ for $(t+1)\ge r$ is given by inputting $[x(t+1-r),\mathrm{\dots},x(t)]$ in the trained model. An $n$-step-ahead forecast is made sequentially by feeding each prediction back into the network at the next time step, eg, a two-step-ahead out-of-sample forecast $\widehat{x}(t+2)$ is made using $[x(t+2-r),\mathrm{\dots},\widehat{x}(t+1)]$.

The idea of the network is thus to use the capabilities of CNNs as autoregressive forecasting models. In a simple autoregressive model of order $p$, the forecasted value for $x(t+1)$ is given by

$$\widehat{x}(t+1)=\sum _{i=1}^{p}{\alpha}_{i}{x}_{t-i}+\u03f5(t),$$ |

where ${\alpha}_{i}$, $i=1,\mathrm{\dots},p$, are learnable weights and $\u03f5(t)$ is white noise. With the WaveNet model as defined above, the forecasted conditional expectation for every $t\in \{0,\mathrm{\dots},N\}$ is

$?[x(t+1)\mid x(t),\mathrm{\dots},x(t-r)]={\beta}_{1}(x(t-r))+\mathrm{\cdots}+{\beta}_{r}(x(t)),$ | (2.14) |

where the functions ${\beta}_{i}$, $i=1,\mathrm{\dots},r$, are data-dependent and optimized through the convolutional network. We remark that even though the weights depend on the underlying data, due to the convolutional structure of the network, the weights are shared across the outputted filter map, resulting in a weight matrix that is translation invariant.

##### Objective function.

The network weights, the filters ${w}_{h}^{l}$, are trained to minimize the mean absolute error (MAE); to avoid overfitting, ie, too large weights, we use an ${L}^{2}$ regularization with regularization term $\gamma $, so that the cost function is given by

$$E(w)=\frac{1}{N}\sum _{t=0}^{N-1}|\widehat{x}(t+1)-x(t+1)|+\frac{\gamma}{2}\sum _{l=0}^{L}\sum _{h=1}^{{M}_{l+1}}{({w}_{h}^{l})}^{2},$$ | (2.15) |

where $\widehat{x}(t+1)$ denotes the forecast of $x(t+1)$ using $x(0),\mathrm{\dots},x(t)$. Minimizing $E(w)$ results in a choice of weights that make a trade-off between fitting the training data and being small. Too large weights often result in the network being overfitted on the training data, so the ${L}^{2}$ regularization, by forcing the weights not to become too big, enables the model to generalize better on unseen data.

###### Remark 2.1 (Relation to the Bayesian framework).

In a Bayesian framework, minimizing this cost function is equivalent to maximizing the posterior distribution under a Laplace distributed likelihood function centered at the value outputted by the model $\widehat{x}(t+1)$ with a fixed scale parameter $\beta =\frac{1}{2}$,

$p(x(t+1)\mid x(0),\mathrm{\dots},x(t),\theta )\sim \mathrm{Laplace}(\widehat{x}(t+1),\beta ),$ | (2.16) |

and with a Gaussian prior on the model parameters.

The output is obtained by running a forward pass through the network with the optimal weights being a point estimate from the posterior distribution. Since the MAE is a scale-dependent accuracy measure, one should normalize the input data to make the error comparable for different time series.

##### Weight optimization.

The aim of training the model is to find the weights that minimize the cost function in (2.15). A standard weight optimization is based on gradient descent, in which one incrementally updates the weights based on the gradient of the error function,

${w}_{h}^{l}(\tau +1)={w}_{h}^{l}(\tau )-\eta \nabla E(w(\tau )),$ | (2.17) |

for $\tau =1,\mathrm{\dots},T$, where $T$ is the number of training iterations and $\eta $ is the learning rate. Each iteration $\tau $ thus consists of a forward run in which one computes the forecasted vector $\widehat{x}$ and the corresponding error $E(w(\tau ))$, and a backward pass in which the gradient vector $\nabla E(w(\tau ))$, containing derivatives with respect to each weight ${w}_{h}^{l}$, is computed and the weights are updated according to (2.17). The gradient vector is computed through backpropagation, which amounts to applying the chain rule iteratively from the error function computed in the final layer until the gradient with respect to the required layer weight ${w}_{h}^{l}$ is obtained:

$\frac{\partial E(w(\tau ))}{\partial {w}_{h}^{l}(j,m)}}={\displaystyle \sum _{i=1}^{{N}_{l}}}{\displaystyle \frac{\partial E(w(\tau ))}{\partial {f}^{l}(i,h)}}{\displaystyle \frac{\partial {f}^{l}(i,h)}{\partial {a}^{l}(i,h)}}{\displaystyle \frac{\partial {a}^{l}(i,m)}{\partial {w}_{h}^{l}(j,m)}},$ | (2.18) |

where we sum over all the nodes in which the weight of interest occurs. The number of training iterations $T$ is chosen so as to achieve convergence in the error. Here, we employ a slightly modified weight update by using the Adam gradient descent (Kingma and Ba 2014). This method computes adaptive learning rates for each parameter by keeping an exponentially decaying average of past gradients and squared gradients and using these to update the parameters. The adaptive learning rate allows the gradient descent to find the minimum more accurately.

##### Activation functions.

In each layer, we use a nonlinearity, or activation function, to transform the output from the convolution, thereby allowing the model to learn nonlinear representations of the data. In our model, the nonlinearity takes the form of the rectified linear unit (ReLU), which is defined as $\mathrm{ReLU}(x):=\mathrm{max}(x,0)$, so that the output from layer $l$ is

$${f}^{l}=[\mathrm{ReLU}({w}_{1}^{l}{*}_{d}{f}^{l-1})+b,\mathrm{\dots},\mathrm{ReLU}({w}_{{M}_{l}}^{l}{*}_{d}{f}^{l-1}+b)],$$ | (2.19) |

where $b\in \mathbb{R}$ denotes the bias that shifts the input to the nonlinearity, ${*}_{d}$ denotes the convolution with dilation $d$ as usual, and ${f}^{l}\in {\mathbb{R}}^{1\times {N}_{l}\times {M}_{l+1}}$ denotes the output of the convolution with filters ${w}_{l}^{h}$, $h=1,\mathrm{\dots},{M}_{l}$, in layer $l$. Unlike the gated activation function used in van den Oord et al (2016a) for audio generation, here we propose to use the ReLU, as it was found to be the most efficient activation function when applied to the forecasting of the nonstationary, noisy time series. At the same time, using the ReLU reduces the training time and thus simplifies the model. The final layer, $l=L$, has a linear activation function, which when followed by the $1\times 1$ convolution outputs the forecasted value of the time series $\widehat{x}=[\widehat{x}(0),\mathrm{\dots},\widehat{x}(N)]$.

When training a deep neural network, one of the problems that keeps the network from learning the optimal weights is that of the vanishing/exploding gradient (Bengio et al 1994; Glorot and Bengio 2010). As backpropagation computes the gradients by the chain rule, when the derivative of the activation function takes on either small or large values, multiplication of these numbers can result in the gradients for the weights in the initial layers vanishing or exploding, respectively. This results in the weights either being updated too slowly, due to the too small gradient, or not being able to converge to the minimum, due to the gradient descent step being too large. One solution to this problem is to initialize the weights of the convolutional layers in such a way that neither in the forward propagation nor in the backward propagation of the network do the weights reduce or magnify the magnitudes of the input signal and gradients, respectively. A proper initialization of the weights would keep the signal and the gradients in a reasonable range of values throughout the layers so that no information would be lost while training the network. As derived in He et al (2015), to ensure that the variance of the input is similar to the variance of the output, a sufficient condition is

$\frac{1}{2}z\mathrm{Var}[{w}_{h}^{l}]=1\mathit{\hspace{1em}}\text{for}h=1,\mathrm{\dots},{M}_{l+1},\forall l,$ | (2.20) |

which leads to a zero-mean Gaussian distribution whose standard deviation is $\sqrt{2/z}$, where $z$ is the total number of trainable parameters in the layer. In other words, the weights of the ReLU units are initialized (for $\tau =0$) as

$${w}_{h}^{l}\sim ?(0,\sqrt{2/z}),$$ | (2.21) |

with $z={M}_{l}\cdot 1\cdot k$: the number of filters in layer $l$ times the filter size $1\times k$.

##### Residual learning.

When adding more layers to the network, standard backpropagation becomes unable to find the optimal weights, resulting in a higher training error. This problem, called the degradation problem (He et al 2016), is therefore not caused by overfitting. Consider a shallow network with a small number of layers alongside its deeper counterpart. The deeper model should not result in a higher training error, since there exists a solution by construction: set all the weights in the added layers to identity mappings. However, in practice, gradient descent algorithms tend to have problems learning the identity mappings. The proposed way around this problem is to use residual connections (He et al 2016) that force the network to approximate $\mathscr{H}(x)-x$ instead of $\mathscr{H}(x)$, the desired mapping, so that the identity mapping can be learned by driving all weights to zero. Optimizing the residual mapping by driving the weights to zero tends to be easier than learning the identity. The way residual connections are implemented is by using shortcut connections, which skip one or more layers and thus get added unmodified to the output from the skipped layers. While in reality the optimal weights are unlikely to exactly match the identity mappings, if the optimal function is closer to the identity than a zero mapping, the proposed residual connections will still aid the network in learning the better optimal weights.

As in van den Oord et al (2016a), in our network we add a residual connection after each dilated convolution from the input to the convolution to the output. When the output from the nonlinearity is passed through a $1\times 1$ convolution prior to adding the residual connection. This is done to make sure that the residual connection and the output from the dilated convolution both have the same number of channels. This allows us to stack multiple layers while retaining the ability of the network to correctly map dependencies learned in the initial layers.

### 2.3 Relation to the discrete wavelet transform

The structure of the network is closely related to the discrete wavelet transform (DWT). Wavelet analysis can be used to understand how a given function changes from one period to the next by matching a wavelet function, of varying scales (widths) and positions, to the function. The DWT is a linear transform of $x={({x}_{t})}_{t=0}^{N-1}$ with $N={2}^{J}$, which decomposes the signal into its high- and low-frequency components by convolving it with high- and low-pass filters. In particular, at each level $j$ of the transform the input signal is decomposed into ${N}_{j}=N/{2}^{j}$ wavelet and scaling coefficients $\u27e8x,{\psi}_{j,k}\u27e9$ and $\u27e8x,{\varphi}_{j,k}\u27e9$ (also called approximations and details) for $k=0,\mathrm{\dots},{N}_{j}-1$, by convolving the input $x$ simultaneously with filters $h$ and $g$ given by

$h(t)={2}^{-j/2}\psi (-{2}^{-j}t),$ | (2.22) | ||

$g(t)={2}^{-j/2}\varphi (-{2}^{-j}t),$ | (2.23) |

where $\psi (\cdot )$ is the wavelet and $\varphi (\cdot )$ is the scaling function. At every subsequent level we apply the transform to the approximation coefficients, thereby discarding the high-frequency components (the details) and ending up with a smoothed version of the input signal. This is very similar to the structure of the CNN, where in each subsequent layer we convolve the input from the previous layer with a learnable filter. In each layer, the filter is used to recognize local dependencies in the data, which are subsequently combined to represent more global features, until in the final layer we compute the output of interest. By allowing the filter to be learnable as opposed to fixed a priori, as is the case in the DWT, we aim to find the filter weights that minimize the objective function (2.15) by recognizing certain patterns in the data, and in this way we obtain an accurate forecast of the time series.

### 2.4 Conditioning

When forecasting a time series $x={({x}_{t})}_{t=0}^{N-1}$ conditional on another series $y={({y}_{t})}_{t=0}^{N-1}$, we aim to maximize the conditional likelihood:

$$p(x\mid y,\theta )=\prod _{t=0}^{N-1}p(x(t+1)\mid x(0),\mathrm{\dots},x(t),y(0),\mathrm{\dots},y(t),\theta ).$$ | (2.24) |

The conditioning on the time series $y$ is done by computing the activation function of the convolution with filters ${w}_{h}^{1}$ and ${v}_{h}^{1}$ in the first layer as

$$\mathrm{ReLU}({w}_{h}^{1}{*}_{d}x+b)+\mathrm{ReLU}({v}_{h}^{1}{*}_{d}y+b)$$ | (2.25) |

for each of the filters $h=1,\mathrm{\dots},{M}_{1}$. When predicting $x(t+1)$, the receptive field of the network must contain only $x(0),\mathrm{\dots},x(t)$ and $y(0),\mathrm{\dots},y(t)$. Therefore, similar to the input, to preserve causality the condition is appended with a vector of zeros the size of the receptive field. In van den Oord et al (2016a), the authors propose to take ${v}_{h}^{1}$ as a $1\times 1$ filter. Given the short input window, this type of conditioning is not always able to capture all dependencies between the time series. Therefore, we use a $1\times k$ convolution, increasing the probability of the correct dependencies being learned with fewer layers. The receptive field of the network thus contains ${2}^{L-1}k$ elements of both the input and the condition(s).

Instead of the residual connection in the first layer, we add skip connections parameterized by $1\times 1$ convolutions from both the input and the condition to the result of the dilated convolution. The conditioning can easily be extended to a multivariate $M\times N$ time series by using $M$ dilated convolutions from each separate condition and adding them to the convolution with the input. The parameterization of the skip connections makes sure that our model is able to correctly extract the necessary relations between the forecast and both the input and condition(s). Specifically, if a particular condition does not improve the forecast, the model can simply learn to discard this condition by setting the weights in the parameterized skip connection (ie, in the $1\times 1$ convolution) to zero. This enables the conditioning to boost predictions in a discriminative way. If the number of filters ${M}_{l}$ is larger than one, the parameterized skip connection uses a $1\times 1$ convolution with ${M}_{l}$ filters, so that the summation of the skip connection and the original convolution is valid. The network structure is shown in Figure 3.

###### Remark 2.2 (Ability to learn nonlinear dependencies).

We remark here on the ability of the model to learn nonlinear dependencies in and between time series. A feedforward neural network requires at least a single hidden layer with a sufficiently large number of hidden units in order to approximate a nonlinear function (Hornik 1991). If we set the filter width to one in the CNN, a necessary requirement for the model to learn nonlinear dependencies will be to have ${M}_{l}>1$, since in this case the role of the filters is similar to that of the hidden units. Alternatively, learning nonlinearities in a CNN requires the use of both a filter width and number of layers larger than one. Each layer essentially computes a dot product and a summation of a nonlinear transformation of several outputs in the previous layer. This output is in turn a combination of the input and condition(s), and the role of the hidden units is played by the summation over the filter width, thereby allowing nonlinear relations to be learned in and between the time series.

## 3 Experiments

Here, we evaluate the performance of the proposed WaveNet architecture versus current state-of-the-art models (RNNs and autoregressive models) when applied to learning dependencies in chaotic, nonlinear time series. The parameters in the model, unless otherwise stated, are set to $k=2$, $L=4$, ${M}_{l}=1$ for $l=0,\mathrm{\dots},L-1$; the Adam learning rate is set to 0.001; and the number of training iterations is 20 000. The regularization rate is chosen to be 0.001. We train networks with different random seeds, discard any network that already underperforms on the training set, and report the average results on the test set over three selected networks.

### 3.1 An artificial example

Coordinate | RMSE uWN | RMSE cWN |
---|---|---|

$X$ | 0.00577 (0.00242) | 0.00174 (0.00133) |

$Y$ | 0.00864 (0.00487) | 0.00583 (0.00350) |

$Z$ | 0.00496 (0.00363) | 0.00536 (0.00158) |

In order to demonstrate the ability of the model to learn both linear and nonlinear dependencies in and between time series, we train and test the model on the chaotic Lorenz system. The Lorenz map is defined as the solution $(X,Y,Z)$ to a system of ordinary differential equations given by

$\dot{X}$ | $=\sigma (Y-X),$ | (3.1) | ||

$\dot{Y}$ | $=X(\rho -Z)-Y,$ | (3.2) | ||

$\dot{Z}$ | $=XY-\beta Y$ | (3.3) |

with initial values $({X}_{0},{Y}_{0},{Z}_{0})$. We approximate the solution using an Euler method. In Table 1, we present the one-step-ahead forecasting results for each of the three coordinates $(X,Y,Z)$ with the unconditional WaveNet (uWN) and the conditional WaveNet (cWN). In the cWN, the forecast of, for example, ${\widehat{X}}_{t}$ contains ${X}_{t-1},\mathrm{\dots},{X}_{t-1-r}$, ${Y}_{t-1},\mathrm{\dots},{Y}_{t-1-r}$ and ${Z}_{t-1},\mathrm{\dots},{Z}_{t-1-r}$. We use a training time series of length 1000, ie, ${({X}_{t})}_{t=1}^{1000}$, ${({Y}_{t})}_{t=1}^{1000}$ and ${({Z}_{t})}_{t=1}^{1000}$. Then, we perform a one-step-ahead forecast of ${X}_{t}$, ${Y}_{t}$ and ${Z}_{t}$ for $t=1000,\mathrm{\dots},1500$ and compare the forecasted series ${\widehat{X}}_{t}$, ${\widehat{Y}}_{t}$ and ${\widehat{Z}}_{t}$ with the true series. The root mean square error (RMSE) is computed over this test set. Comparing the results of the uWN with the RMSE benchmark of 0.00675 from Hsu (2017) obtained with an augmented LSTM, we conclude that the network is very capable of extracting both linear and nonlinear relationships in and between time series. At the same time, conditioning on other related time series reduces the standard deviation, as one can see from the smaller standard deviation in the RMSE of the cWN compared with the uWN.

In Figure 4, we show the forecast of the $X$-coordinate in more detail. As can be seen from both the forecast and the histogram of the error, the cWN results in a more precise forecast. Further, the learning rate of 0.001, while resulting in slower initial convergence, is much more effective at obtaining the minimum training error, both unconditionally and conditionally. Figure 5 shows the out-of-sample forecast of the uWN and the cWN for $X$ and $Y$. Conditioning allows the network to learn the true underlying dynamics of the system, resulting in a much better out-of-sample forecast. From the RMSE in Table 1, we see that the conditioning does not improve the accuracy of the one-step-ahead forecast in the case of $Z$ (as the forecast might already unconditionally be very accurate); however, from the out-of-sample forecast plots in Figure 5 and from Figure 6(b), we can conclude that conditioning is necessary in order to learn the underlying nonlinear and linear dynamics in between the series. Further, with Figure 6 we verify Remark 2.2 numerically. Using only one filter and a filter width of one does not allow the nonlinear dependencies to be learned correctly, while using either a filter width or the number of filters larger than one significantly improves the out-of-sample forecast. Unfortunately, using both $k=2$ and ${M}_{l}=3$ results in the forecast performance worsening, since the combination of the wide receptive field and a large number of parameters results in the model being unable to find the optimal weights.

### 3.2 Financial data

We analyze the performance of the network on the S&P 500 data in combination with the volatility index and the CBOE ten-year interest rate to analyze the ability of the model to extract – both unconditionally and conditionally – meaningful trends and relationships in and between the noisy data sets. Further, we test the performance on several exchange rates to showcase the ability of the model to efficiently learn long-term dependencies.

#### 3.2.1 Data preparation

We define a training period of 750 days (approximately three years) and a testing period of 250 days (approximately one year), on which we perform the one-day-ahead forecasting. The data from January 1, 2005 until December 31, 2016 is split into nine of these periods with nonoverlapping testing periods. Let ${P}_{t}^{s}$ be the value of time series $s$ at time $t$. We define the return for $s$ at time $t$ over a one-day period as

${R}_{t}^{s}={\displaystyle \frac{{P}_{t}^{s}-{P}_{t-1}^{s}}{{P}_{t-1}^{s}}}.$ | (3.4) |

Then, we normalize the returns by subtracting the mean, ${\mu}_{\mathrm{train}}$, and dividing by the standard deviation, ${\sigma}_{\mathrm{train}}$, obtained over all the time series that we will condition on in the training period (note that using the mean and standard deviation over the train and test set would result in look-ahead biases). The normalized return is then given by

${\stackrel{~}{R}}_{t}^{s}={\displaystyle \frac{{R}_{t}^{s}-{\mu}_{\mathrm{train}}}{{\sigma}_{\mathrm{train}}}}.$ | (3.5) |

We then divide the testing periods into three main study periods: period A from 2008 until 2010, period B from 2011 until 2013, and period C from 2014 until 2016. The performance is then evaluated by performing one-step-ahead forecasts over these testing periods and comparing the mean absolute scaled error (MASE) scaled by a naive forecast and the hit rate. A MASE smaller than one means that the absolute size of the forecasted return is more accurate than that of a naive forecast, while a high hit rate shows that the model is able to correctly forecast the direction of the returns.

#### 3.2.2 Benchmark models

We compare the performance of the WaveNet model with two well-known benchmarks: an autoregressive model widely used by econometricians and an LSTM (Hochreiter and Schmidhuber 1997; Fisher and Krauss 2017), which is currently the state-of-the-art when it comes to time series forecasting. As in Fisher and Krauss (2017), the LSTM is implemented using one LSTM layer with twenty-five hidden neurons and a dropout rate of 0.1 followed by a fully connected output layer with one neuron, and we use 500 training epochs. LSTM networks require sequences of input features for training, and we construct the sequences using $r={2}^{L-1}k$ historical time steps so that the receptive field of the WaveNet model is the same as the distance that the LSTM can see into the past. The LSTM is implemented to take as its input a matrix consisting of sequences of all the features (the input and condition(s)), so that its performance can be compared with that of the VAR and the cWN.

#### 3.2.3 Results

##### Index forecasting.

We compare the performance of the uWN and the cWN in forecasting the S&P 500, in the cWN case conditioned on both the volatility index and the CBOE ten-year interest rate. Using one filter and multiple layers should enable the model to learn nonlinear trends and dependencies in and between the time series, and in this example we try to verify this numerically. From Table 2, we see that the uWN performs best in terms of MASE. The cWN exploits the correlation between the three time series, resulting in a higher hit rate but a slightly worse MASE compared with the uWN as it is fitted on multiple noisy series. The LSTM also performs similarly to the cWN in terms of the hit rate but results in a higher MASE, meaning that both networks are able to forecast the direction of the returns but that the LSTM is worse at predicting the size of the return. The WaveNet model outperforms the VAR model conditionally in period A in terms of the hit rate, showcasing the ability of the model to learn relationships that are more complex than simple linear dependencies, if these are present. After 2010, the dependencies between the S&P 500 and the interest rate and volatility index seem to have weakened (due to, for example, the lower interest rate or higher spreads) as the improvement of the cWN over the uWN is smaller. This suggests that the WaveNet can be used to recognize these switches in the underlying financial regimes. Overall, in terms of the hit rate, the WaveNet performs similarly to the state-of-the-art LSTM, particularly in period A, when strong dependencies were still present between the index, the interest rate and the volatility. In the other two periods, the performance of the cWN in terms of the hit rate is similar to those of the naive model and the autoregressive forecasting model, from which we infer that there are no longer strong dependencies between the time series. Further, the good performance of the naive model in periods B and C can be explained by the fact that it implicitly uses the knowledge that the period after the financial crisis was a bull market with a rising price trend. From these results, we can conclude that the WaveNet is indeed able to recognize patterns in the underlying data sets, if they are present. If they are not, then the WaveNet model does not overfit on the noise in the series, as can be seen by the consistently lower MASE compared with the other models.

Period A | Period B | Period C | ||||
---|---|---|---|---|---|---|

Model | MASE | HITS | MASE | HITS | MASE | HITS |

Naive | 1 | 0.513 | 1 | 0.504 | 1 | 0.555 |

VAR | 0.698 | 0.507 | 0.701 | 0.505 | 0.696 | 0.551 |

LSTM | 0.873 (0.026) | 0.525 (0.006) | 1.067 (0.021) | 0.496 (0.016) | 0.929 (0.021) | 0.531 (0.008) |

uWN | 0.685 (0.025) | 0.515 (0.007) | 0.681 (0.002) | 0.484 (0.007) | 0.684 (0.006) | 0.537 (0.011) |

cWN | 0.699 (0.042) | 0.524 (0.009) | 0.693 (0.014) | 0.500 (0.009) | 0.701 (0.015) | 0.536 (0.016) |

Mean return | Standard deviation | Skewness | Kurtosis | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|

Stock | A | B | C | A | B | C | A | B | C | A | B | C |

EURUSD | $-$0.022 | 0.017 | $-$0.061 | 1.751 | 0.572 | 0.942 | 0.864 | 0.165 | 0.090 | 25.91 | 1.706 | 1.910 |

EURJPY | $-$0.050 | 0.053 | $-$0.049 | 2.867 | 0.806 | 1.049 | 1.448 | 0.080 | $-$0.685 | 43.96 | 1.246 | 5.556 |

GBPJPY | $-$0.110 | 0.048 | $-$0.044 | 2.067 | 0.686 | 1.209 | $-$0.009 | 0.387 | $-$1.129 | 17.23 | 2.743 | 13.45 |

EURGBP | 0.045 | 0.018 | $-$0.022 | 1.623 | 0.436 | 0.915 | 1.092 | $-$0.229 | 0.444 | 26.60 | 0.606 | 4.919 |

GPBUSD | $-$0.073 | 0.012 | $-$0.058 | 0.975 | 0.453 | 0.895 | $-$0.257 | 0.039 | $-$1.289 | 1.616 | 0.591 | 15.06 |

Model | Period | EURUSD | EURJPY | GBPJPY | EURGBP | GBPUSD |
---|---|---|---|---|---|---|

VAR | A | 1.105 | 1.176 | 1.446 | 1.348 | 1.832 |

B | 0.758 | 0.782 | 0.756 | 0.768 | 0.731 | |

C | 0.716 | 0.738 | 0.737 | 0.709 | 0.713 | |

LSTM | A | 0.829 (0.012) | 0.863 (0.005) | 0.880 (0.004) | 0.868 (0.005) | 0.893 (0.007) |

B | 0.925 (0.024) | 0.911 (0.029) | 0.974 (0.029) | 0.948 (0.023) | 0.934 (0.014) | |

C | 0.950 (0.016) | 1.031 (0.022) | 0.980 (0.034) | 0.839 (0.034) | 0.898 (0.017) | |

cWN | A | 0.693 (0.016) | 0.667 (0.021) | 0.759 (0.064) | 0.728 (0.014) | 0.834 (0.089) |

B | 0.690 (0.006) | 0.693 (0.006) | 0.699 (0.005) | 0.717 (0.015) | 0.710 (0.009) | |

C | 0.702 (0.009) | 0.716 (0.029) | 0.721 (0.014) | 0.709 (0.004) | 0.716 (0.004) |

##### Exchange rate data.

Next, we analyze the performance of the cWN on several exchange rates in order to compare its ability to discriminate between multiple inputs as well as learn long-term dependencies with that of the VAR and LSTM models. We present a statistical analysis of the exchange rates in Table 3. Of particular relevance to the performance of the model are the standard deviation, skewness and kurtosis. A high standard deviation means that there is a lot of variance in the data. This could cause models to underperform as they become unable to accurately forecast rapid movements. A high positive or negative skewness (that is, the asymmetry of the returns around its mean value) indicates the existence of a long right or left tail, respectively. We train the neural network to fit a symmetric distribution centered at the mean of the data set. The existence of this tail could result in the trained model performing worse in cases of high absolute skewness. Kurtosis is a measure of the tails of the data set compared with those of a normal distribution. High kurtosis is the result of infrequent extreme deviations. If a model tends to overfit the data set, and in particular to overfit these extreme deviations, high kurtosis would result in a worse performance. Figure 7 shows the correlations between the exchange rates in the three periods. As expected, the exchange rates that contain the same currencies exhibit stronger correlations than those with different currencies.

In Table 4, we present the results of the cWN forecast over the exchange rate data, conditional on the other exchange rates. Exchange rate data tends to contain long-term dependencies, so we expect the WaveNet model, with its ability to learn long-term relationships, to perform well. As we see from the table, the WaveNet model consistently outperforms the vector-autoregressive model and the LSTM in terms of the MASE. In period A, the data has very high kurtosis, probably due to the global financial crisis that was happening in 2008. Remarkably, we note that while the autoregressive model during this period of very high kurtosis performs worse than a naive forecast, the WaveNet model does not overfit the extremes, resulting in good performance in terms of the MASE. In periods of high absolute skewness and high standard deviation, but relatively low kurtosis, eg, period C, the WaveNet model and the autoregressive model seem to be performing more or less equally. In period B, we observe a relatively low standard deviation, low kurtosis and a small absolute skewness. In this period, the WaveNet model is better able to extract the underlying dynamics compared with both the autoregressive model and the LSTM. We conclude that the WaveNet model is indeed able to extract long-term relationships, if present. In periods of high kurtosis, it is still able to generalize well, while when the data has a high standard deviation and a high absolute skewness, ie, in situations with many outliers, the model is unable to correctly forecast these outliers, causing the performance to be similar to that of a linear autoregressive model. Further, as we see from Figure 7, some pairs of exchange rates have lower correlations than others. While the autoregressive model, when it has both correlated and uncorrelated time series as input, tends to overfit, the WaveNet model is better able to discriminate between the conditions by simply discarding those that do not improve the forecast, as can be seen from the consistently lower MASE.

## 4 Discussion and conclusion

In this paper, we presented and analyzed the performance of a method for conditional time series forecasting based on a CNN known as the WaveNet architecture (van den Oord et al 2016a). The network makes use of layers of dilated convolutions applied to the input and multiple conditions, thereby learning the trends and relations in and between the data. We analyzed the performance of the WaveNet model on various time series and compared its performance with the current state-of-the-art method in time series forecasting, the LSTM model, as well as a linear autoregressive model. We conclude that even though time series forecasting remains a complex task and finding one model that fits all is hard, we have shown that the WaveNet model is a simple, efficient and easily interpretable network that can act as a strong baseline for forecasting. Nevertheless, there is still room for improvement. One way of improving the ability of the CNN to learn nonlinear dependencies is to use a large number of layers and filters. As we saw from Figure 6, one encounters the problem of a trade-off between the ability to learn nonlinearities, which requires a large number of layers and filters, and that of overfitting, since a large number of layers results in a large receptive field and many parameters. This problem of the imbalance between the need for memory and the nonlinearities was also addressed in Bińkowski et al (2017) by using a combination of an autoregressive model and a CNN. An alternative solution to this problem might be to use the parameterized skip connections in combination with an adaptive filter; this will be studied in our future work. Further, the WaveNet model proved to be a strong competitor to LSTM models, particularly when taking into consideration the training time. While on relatively short time series the prediction time is negligible compared with the training time, for longer time series the prediction of the autoregressive model may be sped up by implementing a recent variation that exploits the memorization structure of the network (see Ramachandran et al 2017) or by speeding up the convolutions by working in the frequency domain employing Fourier transforms, as in Mathieu et al (2013) and Rippel et al (2015). Finally, it is well known that correlations between data points are stronger on an intraday basis. It might therefore be interesting to test the model on intraday data to see if the ability of the model to learn long-term dependencies is even more valuable in that case.

## Declaration of interest

The authors report no conflicts of interest. The authors alone are responsible for the content and writing of the paper.

## Acknowledgements

This research was supported by the European Union in the context of the H2020 EU Marie Curie Initial Training Network project named WAKEUPCALL. We thank an anonymous referee for constructive comments that improved the quality of the paper.

## References

- Aussem, A., and Murtagh, F. (1997). Combining neural network forecasts on wavelet-transformed time series. Connection Science 9(1), 113–122 (https://doi.org/10.1080/095400997116766).
- Bengio, Y., Simard, P., and Frasconi, P. (1994). Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks 5(2), 157–166 (https://doi.org/10.1109/72.279181).
- Bińkowski, M., Marti, G., and Donnat, P. (2017). Autoregressive convolutional neural networks for asynchronous time series. Lecture, August 11, International Conference on Machine Learning 2017, Time Series Workshop.
- Chakraborty, K., Mehrotra, K., Mohan, C. K., and Ranka, S. (1992). Forecasting the behavior of multivariate time series using neural networks. Neural Networks 5(6), 961–970 (https://doi.org/10.1016/S0893-6080(05)80092-9).
- Chung, J., Gulcehre, C., Cho, K., and Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. Preprint (arXiv:1412.3555).
- Cont, R. (2001). Empirical properties of asset returns: stylized facts and statistical issues. Quantitative Finance 1, 223–236 (https://doi.org/10.1080/713665670).
- Fisher, T., and Krauss, C. (2017). Deep learning with long short-term memory networks for financial market predictions. FAU Discussion Paper in Economics 11/2017, Friedrich-Alexander University Erlangen-Nuremberg.
- Glorot, X., and Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the 13th International Conference on Artificial Intelligence and Statistics, pp. 249–256. Proceedings of Machine Learning Research, Volume 9. PMLR Press, Cambridge, MA.
- Hamilton, J. D. (1994). Time Series Analysis, Volume 2. Princeton University Press, Princeton, NJ.
- He, K., Zhang, X., Ren, S., and Sun, J. (2015). Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In Proceedings of the 2015 IEEE International Conference on Computer Vision, pp. 1026–1034. IEEE Press, Piscataway, NJ (https://doi.org/10.1109/ICCV.2015.123).
- He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778. IEEE Press, Piscataway, NJ (https://doi.org/10.1109/CVPR.2016.90).
- Hochreiter, S., and Schmidhuber, J. (1997). Long short-term memory. Neural Computation 9(8), 1735–1780 (https://doi.org/10.1162/neco.1997.9.8.1735).
- Hornik, K. (1991). Approximation capabilities of multilayer feedforward networks. Neural Networks 4(2), 251–257 (https://doi.org/10.1016/0893-6080(91)90009-T).
- Hsu, D. (2017). Time series forecasting based on augmented long short-term memory. Preprint (arXiv:1707.00666).
- Kingma, D. P., and Ba, J. (2014). Adam: a method for stochastic optimization. Preprint (arXiv:1412.6980).
- Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems 25(2), 1097–1105.
- Lahmiri, S. (2014). Wavelet low- and high-frequency components as features for predicting stock prices with backpropagation neural networks. Journal of King Saud University – Computer and Information Sciences 26(2), 218–227 (https://doi.org/10.1016/j.jksuci.2013.12.001).
- LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE 86(11), 2278–2324 (https://doi.org/10.1109/5.726791).
- Mathieu, M., Henaff, M., and LeCun, Y. (2013). Fast training of convolutional networks through FFTs. Preprint (arXiv:1312.5851).
- Mittelman, R. (2015). Time-series modeling with undecimated fully convolutional neural networks. Preprint (arXiv:1508.00317).
- Ramachandran, P., Paine, T. L., Khorrami, P., Babaeizadeh, M., Chang, S., Zhang, Y., Hasegawa-Johnson, M. A., Campbell, R. H., and Huang, T. S. (2017). Fast generation for convolutional autoregressive models. Preprint (arXiv:1704.06001).
- Rippel, O., Snoek, J., and Adams, R. P. (2015). Spectral representations for convolutional neural networks. In Proceedings of the 28th International Conference on Neural Information Processing Systems, Volume 2, pp. 2449–2457. MIT Press, Cambridge, MA.
- van den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A., and Kavukcuoglu, K. (2016a). WaveNet: a generative model for raw audio. Preprint (arXiv:1609.03499).
- van den Oord, A., Kalchbrenner, N., and Kavukcuoglu, K. (2016b). Pixel recurrent neural networks. Preprint (arXiv:1601.06759).
- van den Oord, A., Kalchbrenner, N., Vinyals, O., Espeholt, L., Graves, A., and Kavukcuoglu, K. (2016c). Conditional image generation with PixelCNN decoders. Preprint (arXiv:1606.05328).
- Wang, Z., Yan, W., and Oates, T. (2016). Time series classification from scratch with deep neural networks: a strong baseline. Preprint (arXiv:1611.06455).
- Yu, F., and Koltun, V. (2015). Multi-scale context aggregation by dilated convolutions. Preprint (arXiv:1511.07122).
- Zhang, G. P. (2003). Time series forecasting using a hybrid ARIMA and neural network model. Neurocomputing 50, 159–175 (https://doi.org/10.1016/S0925-2312(01)00702-0).
- Zhang, G. P., Patuwo, B. E., and Hu, M. Y. (1998). Forecasting with artificial neural networks: the state of the art. International Journal of Forecasting 14(1), 35–62 (https://doi.org/10.1016/S0169-2070(97)00044-7).
- Zheng, Y., Liu, Q., Chen, E., Ge, Y., and Zhao, J. (2016). Exploiting multi-channels deep convolutional neural networks for multivariate time series classification. Frontiers of Computer Science 10(1), 96–112 (https://doi.org/10.1007/s11704-015-4478-2).

Only users who have a paid subscription or are part of a corporate subscription are able to print or copy content.

To access these options, along with all other subscription benefits, please contact [email protected] or view our subscription options here: http://subscriptions.risk.net/subscribe

You are currently unable to print this content. Please contact [email protected] to find out more.

You are currently unable to copy this content. Please contact [email protected] to find out more.

Copyright Infopro Digital Limited. All rights reserved.

You may share this content using our article tools. Printing this content is for the sole use of the Authorised User (named subscriber), as outlined in our terms and conditions - https://www.infopro-insight.com/terms-conditions/insight-subscriptions/

If you would like to purchase additional rights please email [email protected]

Copyright Infopro Digital Limited. All rights reserved.

You may share this content using our article tools. Copying this content is for the sole use of the Authorised User (named subscriber), as outlined in our terms and conditions - https://www.infopro-insight.com/terms-conditions/insight-subscriptions/

If you would like to purchase additional rights please email [email protected]