Journal of Computational Finance
ISSN:
14601559 (print)
17552850 (online)
Editorinchief: Christoph Reisinger
Deep learning for discretetime hedging in incomplete markets
Need to know
 Several algorithms based on machine learning to solve hedging problems in incomplete markets with a limited availability of the hedging products are presented. Using an MSE criterion, hedging strategies induced by the algorithms introduced in this paper are shown to be very accurate compared to classical stochastic control techniques on several payoff functions.
 One of the presented algorithms uses a forward pass and manages to deal with problems in dimension up to 4 in a very reasonable computation time.
 Some of the proposed algorithms are flexible enough to deal with some downside risk criteria and P&L distributions obtained are compared to the classical MSE criterion.
 The most efficient algorithm is tested on a case with nonzero transaction costs and we show how to obtain a whole Pareto frontier in a single training phase by randomly combining the criteria of average cost and variance during the learning phase.
Abstract
This paper presents several algorithms based on machine learning to solve hedging problems in incomplete markets. The sources of incompleteness considered here are illiquidity, nontradable risk factors, discrete hedging dates and proportional transaction costs. Hedging strategies suggested by the algorithms introduced in this paper are compared with classical stochasticcontrol techniques on several payoffs using a mean squared error (MSE) criterion. Some of the proposed algorithms are flexible enough to deal with innovative loss criteria, and the profit and loss (P&L) distributions of the hedging strategies obtained with these new criteria are compared to the P&L distributions obtained with the classical MSE criterion. The most efficient algorithm is tested on a case with nonzero transaction costs, and we show how to obtain a whole Pareto frontier in a single training phase by randomly combining the criteria of average cost and variance during the learning phase.
Introduction
1 Introduction
Despite its desirable properties, the completemarket assumption is invalidated as soon as we consider factors such as transaction costs, discretetime hedging dates, illiquidity, nontradable risk factors (eg, volume risk), etc. These factors make the completeness assumption unrealistic in most financial markets, and especially when trading on commodity markets. In an incomplete market, the set of nonattainable contingent claims (ie, contingent claims that cannot be replicated by a selffinancing strategy) is nonempty, and for these a criterion is needed for deciding how to share risks between the seller and the buyer. The literature deals with three families of criteria: quantile hedging, utility functions and momentbased criteria.
Quantile hedging (see Föllmer and Leukert 1999; Bouchard et al 2017) aims to construct hedging strategies that maximize the probability of a successful hedge given a constraint on the required cost. Another possibility offered by quantile hedging is to set a shortfall probability $\epsilon $ and minimize the cost in the class of hedging strategies such that the probability of covering the claim is at least $1\epsilon $.
Utilitybased criteria – more precisely, those based on utility indifference (see Carmona 2008) – are viewed favorably by academics as they sometimes allow for the generation of analytic prices and hedging strategies. However, this approach is not used in practice as the associated riskaversion coefficient is hard to define.
The last family that we use in this paper is based on moments of the distribution of the hedged portfolio. The simplest momentbased criterion is the variance criterion, which minimizes the global and local variance of the hedged portfolio (see Schweizer (1999) for an overview in the context of continuous time). However, quadratic criteria penalize losses and gains in the same way. This might be seen as a drawback; however, it offers the advantage of giving the same price to both buyers and sellers. Gobet et al (2018) extend the local mean squared criterion by introducing an asymmetry in the loss function that penalizes losses more severely than gains. In the case of a variance criterion or a local variance criterion, continuoustime hedging strategies for cases in which assets are modeled using Lévy processes are given, for example, in Tankov (2003).
Once the criterion has been chosen, the trading strategy that minimizes it must be computed. Specific methods must be developed to deal with the sources of incompleteness (whether they be illiquidity, transaction costs, nontradable risk factors, etc).
A limited availability of hedging products can be dealt in two ways. The first approach, taken by Potters and Bouchaud (2003), Gatheral (2010) and Lehalle and Laruelle (2013), is to study the price impact of selling or buying an underlying asset in a given market. Because the size of the impact increases with the volume exchanged, a seller will tend to limit the volume sold at any one time. A second approach consists in assuming that, in practice, risk managers are aware of the liquidity constraints of the markets and try to implement strategies that take these into account. In the case of a globalvariance minimization for a hedged portfolio, Warin (2019) developed a number of algorithms based on regression to calculate the hedging strategy, taking into account all of these liquidity constraints.
In the literature, the treatment of transaction costs is connected with discrete hedging. The pioneering work of Leland (1985) proposes the use of the Black–Scholes formula with a modified volatility. Kabanov and Safarian (2009) extend the Leland (1985) model with replication bound errors. Toft (1996) uses a mean–variance criterion to analyze the tradeoff between the costs and risks of discretely rebalanced option hedges in the presence of transaction costs.
In general, when no closedform formula for the optimal hedging strategy is available, we use stochastic dynamicprogramming algorithms that suffer from the curse of dimensionality. To the best of our knowledge, no algorithm yet exists that defines the optimal strategy with arbitrary criteria together with liquidity constraints and transaction costs while remaining robust at high dimensions
In this article we propose a number of machine learning algorithms to derive optimal hedging strategies.
 •
The first set of algorithms try to calculate hedging positions by solving a global risk minimization problem. The hedging strategies are calculated using different types of architectures. The most efficient architecture is easy to implement and can be used with liquidity constraints, general risk criteria and transaction costs. The algorithm that uses this architecture is fast enough to be used in high dimensions. This approach is directly linked to the algorithm proposed by E et al (2017) that uses a global optimization problem to solve semilinear partial differential equations (PDEs) by controlling the $z$ term in the backward stochastic differential equation approach.
 •
The second and third algorithms are machine learning versions of the two algorithms described in Warin (2019), and can only be used for a variance criterion. In these algorithms, a dynamicprogramming method is used and some minimization problems are solved at each time step in order to calculate the optimal hedging strategy. This approach is based on a succession of local optimizations, and this type of approach has proven to be more effective than that of global optimization in the resolution of nonlinear PDEs (Huré et al 2019; Beck et al 2017; Germain et al 2020).
We first describe the hedging problem, define the price model and present several loss functions used for the experiments. After detailing the different algorithms used, we focus on the variance criterion and compare the results obtained by the different algorithms for options involving a variable number of risk factors, be they tradable or not. As a reference, we use calculations achieved on highperformance computers by the StOpt library (Gevret et al 2016) using Algorithm 2 in Warin (2019). We then train the first algorithm with the different risk criteria already mentioned, and discuss the impact of these criteria on the distribution of the hedged portfolio. Finally, we introduce transaction costs and show how to estimate a Pareto frontier by training the algorithm with random combinations of mean and variance targets.
The main results of the paper are the following.
 •
We show that it is possible to use deep neural network algorithms for reasonably realistic optionhedging problems (eg, discrete time, unhedgeable factors, limited liquidity, transaction costs and general risk criteria).
 •
We compare global and local neural network architectures for the case of a variance criterion and find that unlike the case of PDE resolution, here the global approach appears to be more effective than local minimization, as it often provides more accurate results in far less computation time. The local minimization is often trapped in local minimums: it only provides nearoptimal solutions, even if the algorithm is run several times. This difference in behavior compared with the context of semilinear PDE resolution is certainly related to the fact that, in PDE resolution, the direct process used as the state is not controlled. In our case, part of the state is controlled, and we need to randomly sample it using an a priori law. This seems to be the key point for explaining the better performance of the global approach. The reference calculations, obtained by dynamic programming and regressions, can only be calculated for lower dimensions, but we expect the results would still be very good in higher dimensions.
 •
Using the global algorithm, we are able to efficiently solve some hedging problems that are out of reach for classical dynamicprogramming methods due to the structure of the risk function used. For example, this global approach allows the computation of a Paretoefficient frontier.
2 Problem description
In the numerical tests, we retain the price modeling used in Warin (2019). A short description is given in Section 2.1, and we refer the reader to the original paper for further details.
2.1 Risk factor modeling
We are given a financial market operating in continuous time. We begin with a probability space $(\mathrm{\Omega},\mathcal{F},\mathbb{P})$, a time horizon $$ and a filtration $\mathcal{F}=({\mathcal{F}}_{t})$, with $0\le t\le T$ representing the information available at time $t$. We consider $d+1$ assets, ${\widehat{F}}^{0},\mathrm{\dots},{\widehat{F}}^{d}$, available for trade. For simplicity, we suppose an interest rate of zero and we assume that there exists a riskfree asset ${\widehat{F}}^{0}$ having a strictly positive price. We then use ${\widehat{F}}^{0}$ as a numeraire and immediately move on to quantities discounted with ${\widehat{F}}^{0}$. We denote these discounted quantities by ${F}^{i}={\widehat{F}}^{i}/{\widehat{F}}^{0}$, $i=1,\mathrm{\dots},d$, and denote by $F$ the vector having the ${({F}^{i})}_{i=1,\mathrm{\dots},d}$ as coordinates. We consider another nontradable risk factor (the volume risk), denoted by $?$.
The evolutions of the ${({F}^{i})}_{i=1,\mathrm{\dots},d}$ and $?$ are described by a diffusion process having values in ${\mathbb{R}}^{d}$ and $\mathbb{R}$, respectively.
More precisely, the volume risk ${?}_{t}$ is stochastic, and for $t\ge u\ge 0$ follows the dynamic
$${?}_{t}={\widehat{?}}_{t}+({?}_{u}{\widehat{?}}_{u}){\mathrm{e}}^{{a}_{?}(tu)}+{\int}_{u}^{t}{\sigma}_{?}{\mathrm{e}}^{{a}_{?}(ts)}d{W}_{s}^{?},$$  (2.1) 
where ${a}_{?}$ is the meanreverting coefficient, ${\sigma}_{?}\ge 0$ is the volatility and ${W}_{t}^{?}$ is a Brownian motion on $(\mathrm{\Omega},\mathcal{F},\mathbb{P})$. ${\widehat{?}}_{u}$ is the average load seen in previous years at the given date $u\ge 0$. This meanreverting model is generally used to represent the load dynamic of electricity contracts. We suppose that for $i=1,\mathrm{\dots},d$ the prices are martingales and follow the dynamic
${F}_{t}^{i}$  $={F}_{0}^{i}\mathrm{exp}\left({({\sigma}_{i,E})}^{2}{\displaystyle \frac{{\mathrm{e}}^{2{a}_{i,E}(Tt)}{\mathrm{e}}^{2{a}_{i,E}T}}{4{a}_{i,E}}}+{\mathrm{e}}^{{a}_{i,E}(Tt)}{\widehat{W}}_{t}^{i,E}\right),$  
${\widehat{W}}_{t}^{i,E}$  $={\sigma}_{i,E}{\displaystyle {\int}_{0}^{t}}{\mathrm{e}}^{{a}_{i,E}(ts)}d{W}_{s}^{i},$  (2.2) 
where ${F}_{t}^{i}$ represents the forward price at time $t$ for a delivery at date $T$ ($T$ corresponds to the maturity of the considered contracts and is known a priori), ${a}_{i,E}$ is the meanreverting parameter for risk factor $i$, ${\sigma}_{i,E}$ is the volatility parameter for risk factor $i$ and ${W}_{s}^{i}$ is a Brownian motion on $(\mathrm{\Omega},\mathcal{F},\mathbb{P})$ so that the ${W}_{t}^{i}$ are correlated and also correlate with ${W}_{t}^{?}$. We will denote by ${S}_{t}$ the vector $({F}_{t}^{1},\mathrm{\dots},{F}_{t}^{d},{?}_{t})$.
2.2 The hedging problem
We consider the hedging problem of a contingent claim paying $g({S}_{T})$ at time $T$, where ${S}_{T}$ denotes the vector of underlying values of the contingent claim. Without loss of generality, in the following we consider ourselves as the derivative seller. We consider a finite set of hedging dates: $$ The discrete hedging dates introduce the first source of incompleteness. At each date, each of the discounted assets ${F}^{i}$ can only be bought and sold in a finite quantity ${l}^{i}$, giving a second source of incompleteness. The volume risk ${?}_{t}$ cannot be traded and is the third source of incompleteness. A selffinancing portfolio is a $d$dimensional $({\mathcal{F}}_{t})$adapted process ${\mathrm{\Delta}}_{t}$. Its terminal value at time $T$ is denoted by ${X}_{T}^{\mathrm{\Delta}}$ and satisfies
${X}_{T}^{\mathrm{\Delta}}$  $=p+{\displaystyle \sum _{i=1}^{d}}{\displaystyle \sum _{j=0}^{N1}}{\mathrm{\Delta}}_{{t}_{j}}^{i}({F}_{{t}_{j+1}}^{i}{F}_{{t}_{j}}^{i}),$ 
where $p$ denotes the premium. Between two time steps, the change in ${\mathrm{\Delta}}^{i}$, corresponding to the buy or sell command ${C}_{j+1}^{i}:={\mathrm{\Delta}}_{{t}_{j+1}}^{i}{\mathrm{\Delta}}_{{t}_{j}}^{i}$, should not exceed in absolute value the liquidity ${l}^{i}$, so that
${\mathrm{\Delta}}_{{t}_{0}}^{i}\le {l}^{i},{C}_{j}^{i}\le {l}^{i},j=1,\mathrm{\dots},N1,i=1,\mathrm{\dots},d.$ 
Given a loss function $L$, and defining ${Y}_{T}={X}_{T}^{\mathrm{\Delta}}g({S}_{T})$, we search for an optimal strategy that verifies
$$({p}^{\mathrm{Opt}},{\mathrm{\Delta}}^{\mathrm{Opt}})=\underset{p,\mathrm{\Delta}}{\mathrm{arg}\mathrm{min}}L({X}_{T}^{\mathrm{\Delta}}g({S}_{T}))=\underset{p,\mathrm{\Delta}}{\mathrm{arg}\mathrm{min}}L({Y}_{T}).$$  (2.3) 
We will focus on the following loss functions.
 Mean squared error.

This is defined by
$$L(Y)=?[{Y}^{2}].$$ (2.4) This function has been intensively studied (see, for example, Schweizer 1999). It has the drawback of penalizing losses and gains in the same way. This can also be seen as an advantage, as it gives the same value and strategy for both buyer and seller.
 Asymmetrical loss.

This is defined by
$${L}^{\alpha}(Y)=?[(1+\alpha ){Y}^{2}{\mathrm{?}}_{Y\le 0}+{Y}^{2}{\mathrm{?}}_{Y\ge 0}].$$ (2.5) Under this function, losses are penalized when $\alpha >0$ and gains are penalized when $$. This function has been studied, for example, in Gobet et al (2018).
 Loss moment 2/moment 4 function.

This is defined by
$${L}^{\alpha}(Y)=?[{Y}^{2}{\mathrm{?}}_{Y\ge 0}]+\alpha ?[{Y}^{4}{\mathrm{?}}_{Y\le 0}],\alpha \ge 1.$$ (2.6) This criterion is designed to penalize heavy tails on the loss side.
3 Some classical neural networks and stochastic gradient algorithms
Deep neural networks are stateoftheart tools for approximating functions (see Liang 2017). This section presents two common network architectures used in machine learning. The feedforward architecture is the simplest and the most common; it is used, for example, in E et al (2017), Huré et al (2019), Beck et al (2017) and Germain et al (2020) to solve nonlinear PDE problems. The second is a recurrent architecture with memory, and allows for solving nonMarkovian problems: it is able to keep track of relevant past variables and use them as input data. As the neural approximation of a function is highly nonlinear, it often leads to a nonconvex optimization problem that needs a specific resolution algorithm, which we will detail later.
3.1 Feedforward neural networks as function approximators
For the local algorithms, we use feedforward neural networks composed of dense layers. Such neural networks are characterized by their depth, ie, the number of layers, which we denote by $L$ ($L\in {\mathbb{N}}^{*}$). If $L>1$, the neural network is considered deep.
Each layer is composed of ${m}_{\mathrm{\ell}}$ neurons (also called units), $\mathrm{\ell}\in \{0;\mathrm{\dots};L1\}$; ${m}_{\mathrm{\ell}}$ is also the output dimension of the layer. In particular, the width of the output layer corresponds to the output dimension. Each unit applies an affine transformation and an activation function on the output of the previous layer. As a result, the output of a layer is as follows:
$\begin{array}{cc}\hfill {y}_{0}& ={f}_{0}(x)={a}_{0}({?}_{0}\cdot x+{\beta}_{0}),\hfill \\ \hfill {y}_{\mathrm{\ell}}& ={f}_{\mathrm{\ell}}({y}_{\mathrm{\ell}1})={a}_{\mathrm{\ell}}({?}_{\mathrm{\ell}}\cdot {y}_{\mathrm{\ell}1}+{\beta}_{\mathrm{\ell}})\hspace{1em}\forall \mathrm{\ell}\ge 1,\hfill \end{array}\}$  (3.1) 
with ${?}_{\mathrm{\ell}}$ being a matrix (referred to as weights) consisting of ${m}_{\mathrm{\ell}}$ rows $\times $ ${m}_{\mathrm{\ell}1}$ columns, and ${\beta}_{\mathrm{\ell}}$ being a column vector (referred to as bias) of size ${m}_{\mathrm{\ell}}$. The most common choice of activation function ${a}_{\mathrm{\ell}}$ in intermediate layers is the rectified linear unit (ReLU) function.
Let us denote by ${d}_{0}$ the input dimension and by ${d}_{1}$ the output dimension. The whole neural network is then the composition of its layers:
$\begin{array}{cc}\hfill {\mathbb{R}}^{{d}_{0}}& \mapsto {\mathbb{R}}^{{d}_{1}},\hfill \\ \hfill x& \mapsto {f}_{L1}\circ \mathrm{\dots}\circ {f}_{0}(x).\hfill \end{array}\}$  (3.2) 
The ${?}_{\mathrm{\ell}}$ matrixes and ${\beta}_{\mathrm{\ell}}$ vectors are the parameters of the neural network. From now on, we will denote them by $\theta $, $\theta \in {\mathbb{R}}^{N}$, where
$$N={m}_{0}(1+{d}_{0})+\sum _{\mathrm{\ell}=1}^{L1}{m}_{\mathrm{\ell}}(1+{m}_{\mathrm{\ell}1}).$$ 
The universal approximation theorem of Hornik et al (1990) states that the set of single fully connected layers with varying width $m$ is dense in ${L}^{2}(\nu )$ for any finite measure $\nu $ on ${\mathbb{R}}^{d}$, $d>0$, whenever the activation function $a$ is continuous and nonconstant. Deep feedforward networks are a class of universal approximators. Consequently, assuming that the optimal control of (2.3) is sufficiently smooth, the equation can be approximated by a feedforward dense neural network to a desired accuracy provided we use a sufficient number of units. The universal approximation theorem does not tell us the minimum required depth and width, so empirical tests have to be done to determine the best architecture for the neural network. The theorem also does not mention how to optimize the neural network parameters, but it appears that stochastic gradient descent shows good results in many cases.
3.2 Recurrent and classical long shortterm memory neural networks as timedependentfunction approximators
A recurrent neural network (RNN) is a neural network in which connections follow a directed graph along a temporal sequence given as an input. This effectively models a dynamic temporal behavior. An RNN takes a sequence ${S}_{t}$, $t\in \{0,\mathrm{\dots},T\}$, as input data and outputs two vectors: a memorystate sequence $M$ and an outputstate sequence $C$. This is done by recursively calling a neural network, referred to as the recurrent cell, through all time steps: for each $t$, the recurrent cell takes ${S}_{t}$, ${M}_{t1}$ and ${C}_{t1}$ as input data and outputs ${M}_{t}$ and ${C}_{t}$ (see Figure 3).
When calling a basic RNN cell multiple times, a vanishinggradient phenomenon occurs in the memory state. This prevents this type of cell from capturing longterm dependencies in input sequences. Long shortterm memory (LSTM) cells (Hochreiter and Schmidhuber 1997) implement a refined memory mechanism that overcomes this limitation. In an LSTM cell, structures called gates regulate the flow of information contained in the memory state ${M}_{t}$ by adding or removing information. Gates behave like Boolean masks: they are composed of a sigmoid layer and a pointwise multiplication operation. The implementation of an LSTM cell is as follows:
$\begin{array}{cc}\hfill {\mathrm{\Gamma}}_{t}^{\mathrm{f}}& =\sigma ({A}_{\mathrm{f}}{S}_{t}+{U}_{\mathrm{f}}{C}_{t1}+{b}_{\mathrm{f}}),\hfill \\ \hfill {\mathrm{\Gamma}}_{t}^{\mathrm{i}}& =\sigma ({A}_{\mathrm{i}}{S}_{t}+{U}_{\mathrm{i}}{C}_{t1}+{b}_{\mathrm{i}}),\hfill \\ \hfill {\mathrm{\Gamma}}_{t}^{\mathrm{o}}& =\sigma ({A}_{\mathrm{o}}{S}_{t}+{U}_{\mathrm{o}}{C}_{t1}+{b}_{\mathrm{o}}),\hfill \\ \hfill {M}_{t}& ={\mathrm{\Gamma}}_{t}^{\mathrm{f}}\odot {M}_{t1}+{\mathrm{\Gamma}}_{t}^{\mathrm{i}}\odot \mathrm{tanh}({A}_{M}{S}_{t}+{U}_{M}{C}_{t1}+{b}_{M}),{M}_{0}=0,\hfill \\ \hfill {C}_{t}& ={\mathrm{\Gamma}}_{t}^{\mathrm{o}}\odot \mathrm{tanh}({M}_{t}),{C}_{0}=0,\hfill \end{array}\}$  (3.3) 
where $\odot $ is the Hadamard product, $\sigma $ is the sigmoid activation function ($\sigma (x)=1/(1+{\mathrm{e}}^{x})$), ${A}_{\cdot}\in {\mathbb{R}}^{h\times d}$, ${U}_{\cdot}^{h\times h},{b}_{\cdot}\in {\mathbb{R}}^{h}$, with $h$ being the cell state size. ${\mathrm{\Gamma}}_{t}^{\mathrm{f}}$ is the forgetgate layer, which masks irrelevant information from ${M}_{t1}$ to ${M}_{t}$. ${\mathrm{\Gamma}}_{t}^{\mathrm{i}}$ is the inputgate layer, which controls the new information added to ${M}_{t}$. This new information is the output of a layer taking ${S}_{t}$ and ${C}_{t1}$ as inputs. The outputgate layer ${\mathrm{\Gamma}}_{t}^{\mathrm{o}}$ controls which parts of the memory state ${M}_{t}$ need to be outputted in ${C}_{t}$. We use the output ${C}_{t}$ as an approximation of the unknown optimal control. The weight matrixes and bias vectors (${A}_{\cdot},{U}_{\cdot},{b}_{\cdot}$) are shared through all time steps; these are updated in the training process. We also denote by $\theta $ the parameters of the LSTM representation (3.3).
3.3 General optimization algorithm
The training of the neural network is potentially a nonconvex and nonlinear optimization problem. To help prevent halting on a suboptimal solution to the training problem, we use a minibatch gradient descent rather than a standard gradient descent, ie, the gradients are computed on random samples of the training data at each iteration. We use adaptive moment estimation (Adam) (Kingma and Ba 2014) as the update rule for the neural network parameters. In addition to storing an exponentially decaying average of past squared gradients, ${v}_{t}$, as done by AdaDelta (Zeiler 2012) and RMSprop (Tieleman and Hinton 2012), Adam also keeps an exponentially decaying average of past gradients, ${m}_{t}$.
4 Optimal network for the global hedging problem
In this section, we compare the use of two feedforward networks with the LSTM network and an extension of the LSTM network that we propose. We first explain how to use the previously proposed feedforward network in the context of hedging a call option in the Black–Scholes model. We then detail our LSTM extension and compare the results obtained for the hedging problem without any hedging constraints. We show that the modified LSTM network gives the best results. Finally, we explain how to adapt our modified LSTM network to deal with liquidity constraints. In the following,
$${\stackrel{~}{S}}_{t}=\frac{{S}_{t}?[{S}_{t}]}{\sqrt{?[{({S}_{t}?({S}_{t}))}^{2}]}}$$ 
denotes a normalized version of ${S}_{t}$.
Algorithm 1 describes the training procedure for the global resolution of the hedging problem.
4.1 Feedforward neural network architectures on the global hedging problem
A possible approach to solving the hedging problem described in Section 2.2 consists in training $N$ different feedforward neural networks (one per time step), as done in Han et al (2018) for the PDE case. This approach is illustrated in Figure 1 and described in Section 4.1.1. This architecture (hereafter referred to as feedforward control) generates a potentially high number of weights and bias to be estimated ($N\times \mathrm{depth}\times \mathrm{width}$). Another possibility is to train a single feedforward neural network fed with the prices and the time to maturity, as proposed in ChanWaiNam et al (2019). This approach is illustrated in Figure 2 and described in Section 4.1.2. This architecture is hereafter referred to as feedforward merged control.
4.1.1 Feedforward control structure
In the feedforward control network, $N1$ networks are fed successively with ${({\stackrel{~}{S}}_{{t}_{i}})}_{i=1,\mathrm{\dots},N1}$. The feedforward networks are parameterized by $\theta $ (the weights and bias to be estimated). The $i$th feedforward neural network provides a $d$dimensional control, ${\mathrm{\Delta}}_{{t}_{i}}({\stackrel{~}{S}}_{{t}_{i}},\theta )$. The first control, ${\mathrm{\Delta}}_{{t}_{0}}({\stackrel{~}{S}}_{{t}_{0}},\theta )$, and the premium, $p(\theta )$, are trainable variables. The final payoff is given by
${X}_{T}(\theta )$  $=p(\theta )+{\displaystyle \sum _{i=1}^{d}}{\displaystyle \sum _{j=0}^{N1}}{\mathrm{\Delta}}_{{t}_{j}}^{i}({\stackrel{~}{S}}_{{t}_{j}},\theta )({F}_{{t}_{j+1}}^{i}{F}_{{t}_{j}}^{i}).$ 
The problem (2.3) leads to the following optimization problem:
$${\theta}^{*}=\underset{\theta}{\mathrm{arg}\mathrm{min}}L({X}_{T}(\theta )g({S}_{T})).$$ 
4.1.2 Feedforward merged control structure
In the feedforward merged control structure a single neural network is fed successively with ${({\stackrel{~}{S}}_{{t}_{i}})}_{i=1,\mathrm{\dots},N1}$. For each pair $({t}_{i},{\stackrel{~}{S}}_{{t}_{i}})$, the network provides a control $\mathrm{\Delta}({t}_{i},{\stackrel{~}{S}}_{{t}_{i}},\theta )$, where $\theta $ represents the weights and bias to be estimated. Again, the first control, $\mathrm{\Delta}({t}_{0},{\stackrel{~}{S}}_{{t}_{0}},\theta )$, and the premium, $p(\theta )$, are trainable variables. The final payoff is given by
${X}_{T}(\theta )$  $=p(\theta )+{\displaystyle \sum _{i=1}^{d}}{\displaystyle \sum _{j=0}^{N1}}{\mathrm{\Delta}}^{i}({t}_{j},{\stackrel{~}{S}}_{{t}_{j}},\theta )({F}_{{t}_{j+1}}^{i}{F}_{{t}_{j}}^{i}).$ 
The problem (2.3) leads to the following optimization problem:
$${\theta}^{*}=\underset{\theta}{\mathrm{arg}\mathrm{min}}L({X}_{T}(\theta )g({S}_{T})).$$  (4.1) 
4.2 Recurrent networks
Given the sequential nature of the hedging problem, the use of RNNs is relevant. This type of network is used, for example, in ChanWaiNam et al (2019) for the PDE numerical resolution problem. As mentioned in Chung et al (2014), among all RNN architectures, LSTM neural networks (see Hochreiter and Schmidhuber 1997) present several advantages, including convergence speed and memory management. They allow, for example, for the network to be used with nonMarkovian underlying models. This architecture will hereafter be referred to as classical LSTM. As more layers may represent more complex functions of the inputs, we propose to test whether the addition of a feedforward network to the LSTM cell output (as shown in Figure 4) helps improve the solution obtained. This composition of an LSTM cell and a feedforward network is hereafter referred to as an augmented LSTM cell.
The recurrent cell is fed with ${\stackrel{~}{S}}_{t}$. Its recursive calls on a sequence of inputs provides a sequence of underlying position changes (see Figure 3). At each date ${t}_{j}$, the recurrent cell produces a $d$dimensional output depending on historical events and controls ${\widehat{C}}_{j}(\theta ,{({\stackrel{~}{S}}_{{t}_{s}})}_{s\le j},{({\mathrm{\Delta}}_{{t}_{s}})}_{s\le j}))$ (denoted simply as ${C}_{t}$ in Figure 4) that are not bounded. The values of $\mathrm{\Delta}$ for the strategy are calculated for $j=0,\mathrm{\dots},N1$; $i=1,\mathrm{\dots},d$:
${\mathrm{\Delta}}^{i}({t}_{j},{({\stackrel{~}{S}}_{{t}_{i}})}_{i\le j},\theta )$  $={\displaystyle \sum _{k=0}^{j}}{\widehat{C}}_{k}^{i}({({\stackrel{~}{S}}_{{t}_{s}})}_{s\le j},\mathrm{\Delta}({t}_{s},{({\stackrel{~}{S}}_{{t}_{s}})}_{s\le j},\theta )).$  (4.2) 
The final payoff is then given by
${X}_{T}(\theta )$  $=p(\theta )+{\displaystyle \sum _{i=1}^{d}}{\displaystyle \sum _{j=0}^{N1}}{\mathrm{\Delta}}^{i}({t}_{j},{({\stackrel{~}{S}}_{{t}_{k}})}_{k\le j},\theta )({F}_{{t}_{j+1}}^{i}{F}_{{t}_{j}}^{i}).$ 
The problem (2.3) leads to the following optimization problem:
$${\theta}^{*}=\underset{\theta}{\mathrm{arg}\mathrm{min}}L({X}_{T}(\theta )g({S}_{T})).$$  (4.3) 
4.3 Extra parameters of the neural networks
The results of the neural networks depend on some extra parameters listed below. Unless otherwise specified, these parameters are shared for all the test cases.
 •
The batch size (the number of simulations we use at each iteration of the Adam optimizer) is equal to 50.
 •
The initial learning rate for Adam is equal to 0.001 (which is the default value).
 •
The number of units (with the dimension of ${M}_{t}$) in the LSTM cell is equal to 50.
 •
We use three ReLU layers of width 10 for the augmented LSTM cell.
 •
The data is normalized before being used by the neural networks. The mean and variance used for the normalization are computed once for all over a subset of 100 000 simulations.
 •
Unless otherwise specified, the number of iterations in the gradient descent algorithm is equal to 20 000. Every 1000 iterations, we keep the neural network state if it gives a better loss on the test set than the previous state.
The following tests are done using TensorFlow (Abadi et al 2015).^{1}^{1} 1 The software is available from http://www.tensorflow.org.
4.3.1 Numerical comparison of global neural network architectures
Table 1 compares the mean squared hedging error of (2.4) obtained with the two feedforward architectures and the augmented LSTM architecture for a Black–Scholes call option (with trend $\mu $ and volatility $\sigma $) with no liquidity constraints. Results obtained with the Black–Scholes $\mathrm{\Delta}$ are also shown. After 20 000 iterations, the results obtained with the augmented LSTM architecture are better than those obtained with the two feedforward networks. Of course, in this almost complete market setting, the Black–Scholes $\mathrm{\Delta}$ is close to the optimal solution, and the Black–Scholes error would be zero in a continuoustime implementation. We can see that the replication error given by the augmented LSTM is close to the Black–Scholes error.
Mean  

squared  
error  
Black–Scholes $\mathrm{\Delta}$ ($N\text{(}{d}_{\text{1}}\text{)}$)  1.61e$$05 
Feedforward delta [10, 10, 10]  1.32e$$04 
Feedforward delta [10, 15, 30]  1.31e$$04 
Feedforward merged [10, 10, 10]  1.37e$$04 
Feedforward merged [10, 15, 30]  1.30e$$04 
Augmented LSTM with 50 units [10, 10, 10]  1.73e$$05 
In Table 2, we show the mean squared error of (2.4) for loss derived from a classical LSTM cell on a vanilla call option free from liquidity contraints and on a twomarket spread call option (having payoff ${({S}_{T}^{1}{S}_{T}^{2}K)}^{+}$). We compare this with the loss derived from the augmented LSTM cell. We can see that, for the more complex payoff represented by the twomarket spread, the augmented LSTM cell gives better results.
Twomarket  

Black–Scholes  spread  
Classical LSTM cell  5.73e$$05  3.64e$$04 
Augmented LSTM cell  3.97e$$05  1.11e$$04 
4.4 Adaptation of the recurrent architecture to deal with liquidity constraints
As the control in the case of liquidity constraints is bounded, the output of the network has to be transformed. To achieve this, we propose to use a $\mathrm{tanh}$ activation function as follows:
${\mathrm{\Delta}}^{i}({t}_{j},{({\stackrel{~}{S}}_{{t}_{i}})}_{i\le j},\theta )$  $={l}^{i}{\displaystyle \sum _{k=0}^{j}}\mathrm{tanh}({\widehat{C}}_{k}^{i}({({\stackrel{~}{S}}_{{t}_{s}})}_{s\le j},\mathrm{\Delta}({t}_{s},{({\stackrel{~}{S}}_{{t}_{s}})}_{s\le j},\theta ))).$  (4.4) 
With this technique, the control difference between two time steps belongs to $[{l}_{i},{l}_{i}]$.
5 Local algorithms for the hedging problem with constraints
The two other algorithms used to solve the hedging problem with constraints are local algorithms based on a dynamicprogramming principle proposed in Warin (2019). The objective function to minimize is given by (2.3), (2.4), and this corresponds to a globalvariance hedging problem. In Warin (2019), grids are used for the discretization of the asset level and regressions are used to calculate conditional expectations. As previously stated, these two algorithms can only be used to optimize variance problems.
We note that the two local machine learning algorithms proposed can be related to some recent work in Huré et al (2018, 2019) and Bachouch et al (2018).
We introduce the spaces for $\stackrel{~}{\mathrm{\Delta}}$ in ${\mathbb{R}}^{d}$:
${W}_{i}(\stackrel{~}{\mathrm{\Delta}})$  $=\{(V,\mathrm{\Delta})\in \mathbb{R}\times {\mathbb{R}}^{d},{\mathcal{F}}_{{t}_{i}}\text{adapted with}{\mathrm{\Delta}}^{k}{\stackrel{~}{\mathrm{\Delta}}}^{k}\le {l}^{k}\text{for}k=1,\mathrm{\dots},d\},$  
${\mathrm{\Theta}}_{i}(\stackrel{~}{\mathrm{\Delta}})$  $=\{({\mathrm{\Delta}}_{i},\mathrm{\dots},{\mathrm{\Delta}}_{N1}),\text{where for}j\ge i,{\mathrm{\Delta}}_{j}\text{are}{\mathbb{R}}^{d}\text{valued}$  
$\mathrm{\hspace{1em}\hspace{1em}}\mathit{\hspace{1em}}{\mathcal{F}}_{{t}_{j}}\text{adapted with}{\mathrm{\Delta}}_{i}^{k}{\stackrel{~}{\mathrm{\Delta}}}^{k}\mid \le {l}^{k},{\mathrm{\Delta}}_{j+1}^{k}{\mathrm{\Delta}}_{j}^{k}\le {l}^{k}$  
$$  
${\widehat{W}}_{i}(\stackrel{~}{\mathrm{\Delta}})$  $=\{(V,\mathrm{\Delta})\text{where}V\text{is}\mathbb{R}\text{valued},{\mathcal{F}}_{{t}_{i}}\text{adapted},\mathrm{\Delta}\in {\mathrm{\Theta}}_{i}(\stackrel{~}{\mathrm{\Delta}})\}.$ 
As shown in Proposition 3.1 in Warin (2019), the problem (2.3), (2.4) can be written as
$(\widehat{p},\widehat{\mathrm{\Delta}})=$  $\underset{p\in \mathbb{R},\mathrm{\Delta}\in {\mathrm{\Theta}}_{0}(0)}{\mathrm{arg}\mathrm{min}}{\displaystyle \sum _{i=2}^{N}}?\left[{\left({V}_{i}{\displaystyle \sum _{k=1}^{d}}{\mathrm{\Delta}}_{i1}^{k}({F}_{{t}_{i}}^{k}{F}_{{t}_{i1}}^{k}){V}_{i1}\right)}^{2}\right]$  
$\mathrm{\hspace{1em}\hspace{1em}}+?\left[{\left({V}_{1}{\displaystyle \sum _{k=1}^{d}}{\mathrm{\Delta}}_{0}^{k}({F}_{{t}_{1}}^{k}{F}_{{t}_{0}}^{k})p\right)}^{2}\right],$  (5.1) 
where ${V}_{i}$ satisfies
${V}_{N}=$  $g({S}_{T}),$  
${V}_{i}=$  $?[g({S}_{T}){\displaystyle \sum _{k=1}^{d}}{\displaystyle \sum _{j=i}^{N1}}{\mathrm{\Delta}}_{j}^{k}({F}_{{t}_{j+1}}^{k}{F}_{{t}_{j}}^{k}){\mathcal{F}}_{{t}_{i}}]\mathit{\hspace{1em}}\forall i=1,\mathrm{\dots},N1.$  (5.2) 
5.1 First local algorithm
Equation (5.1) leads to a dynamicprogramming algorithm (shown in Algorithm 2). The optimal residual $R$ at date ${t}_{i}$, for current state ${S}_{{t}_{i}}$ and having in the portfolio an investment in ${\mathrm{\Delta}}_{i1}$ assets, can be written as
$$R({t}_{i},{S}_{{t}_{i}},{\mathrm{\Delta}}_{i1})=\underset{(V,\mathrm{\Delta})\in {\widehat{W}}_{i}({\mathrm{\Delta}}_{i1})}{\mathrm{min}}?\left[{(g({S}_{T})\sum _{k=1}^{d}\sum _{j=i}^{N1}{\mathrm{\Delta}}_{j}^{k}({F}_{{t}_{j+1}}^{k}{F}_{{t}_{j}}^{k})V)}^{2}\right{\mathcal{F}}_{{t}_{i}}].$$  (5.3) 
Equation (5.1) then gives
$$R({t}_{i},{S}_{i},{\mathrm{\Delta}}_{i1})\underset{(V,\mathrm{\Delta})\in {W}_{i}({\mathrm{\Delta}}_{i1})}{\mathrm{min}}?[{(\stackrel{~}{V}\sum _{k=1}^{d}{\mathrm{\Delta}}_{i}^{k}({F}_{{t}_{i+1}}^{k}{F}_{{t}_{i}}^{k})V)}^{2}+R({t}_{i+1},{S}_{{t}_{i+1}},{\mathrm{\Delta}}_{i}){\mathcal{F}}_{{t}_{i}}],$$  (5.4) 
where $\stackrel{~}{V}$ is the first component of the arguments of the minimums in equation (5.3) calculating $R({t}_{i+1},{S}_{{t}_{i+1}},{\mathrm{\Delta}}_{i})$.
In the special case where the prices are martingales, the $(\stackrel{~}{V},V)$ in (5.4) are independent of the hedging strategy, and are given by $(?[g({S}_{T})\mid {\mathcal{F}}_{{t}_{i+1}}],?[g({S}_{T})\mid {\mathcal{F}}_{{t}_{i}}])$. This gives us
$$R({t}_{0},{S}_{{t}_{0}},0)=\underset{\mathrm{\Delta}\in {\mathrm{\Theta}}_{0}(0)}{\mathrm{min}}?\left[\sum _{i=0}^{N1}{\left(?[g({S}_{T})\mid {\mathcal{F}}_{{t}_{i+1}}]\sum _{k=1}^{d}{\mathrm{\Delta}}_{i}^{k}({F}_{{t}_{i+1}}^{k}{F}_{{t}_{i}}^{k})?[g({S}_{T})\mid {\mathcal{F}}_{{t}_{i}}]\right)}^{2}\right].$$  (5.5) 
Only the hedging strategy remains to be calculated, which is achieved by solving the classical local minimum variance problem, leading to minimization at each time step:
$\underset{\mathrm{\Delta}\in {\mathbb{R}}^{d}}{\mathrm{min}}?\left[{(\stackrel{~}{V}{\displaystyle \sum _{k=1}^{d}}{\mathrm{\Delta}}^{k}({F}_{{t}_{i+1}}^{k}{F}_{{t}_{i}}^{k})V)}^{2}\right{\mathcal{F}}_{{t}_{i}}].$  (5.6) 
Our goal, then, is to use a neural network to approximate both the ${V}_{i}$ functions (the conditional expectations) and the optimal control ${\mathrm{\Delta}}_{i}$ as functions of ${S}_{{t}_{i}}$ at each date ${t}_{i}$ by minimizing (5.6) at each time step via a backward recursion. Unlike $V$, the values of $\mathrm{\Delta}$ are bounded due to liquidity constraints, and ${\mathrm{\Delta}}_{j}\in [{\underset{\xaf}{\mathrm{\Delta}}}_{j},{\overline{\mathrm{\Delta}}}_{j}]$ where the minimum constraints ${\underset{\xaf}{\mathrm{\Delta}}}_{j}$ and maximum constraints ${\overline{\mathrm{\Delta}}}_{j}$ are in ${\mathbb{R}}^{d}$.
To normalize the position in hedging products, we introduce
$${\widehat{\mathrm{\Delta}}}_{j}={\psi}_{j}({\mathrm{\Delta}}_{j}):=({\mathrm{\Delta}}_{j}{\underset{\xaf}{\mathrm{\Delta}}}_{j})/({\overline{\mathrm{\Delta}}}_{j}{\underset{\xaf}{\mathrm{\Delta}}}_{j})$$ 
such that ${\widehat{\mathrm{\Delta}}}_{j}\in [0,1]$.
At each time step, a feedforward neural network is used to estimate the portfolio value and the normalized command as functions of the normalized uncertainties and positions: ${\widehat{V}}_{j}({\theta}_{j};{\widehat{S}}_{{t}_{j}},{\widehat{\mathrm{\Delta}}}_{j}),\widehat{C}({\theta}_{j};{\widehat{S}}_{{t}_{j}},{\widehat{\mathrm{\Delta}}}_{j})$. Algorithm 2 solves (5.6) via a backward recursion. Then, at each time step, (5.7) is resolved through a machine learning approach in which each function depends on some normalized variables to ease the convergence of the method, and by using a classical stochastic gradient descent.
${\theta}_{j}^{*}$  $=\underset{\theta}{\mathrm{arg}\mathrm{min}}?[({U}_{j+1}({\widehat{S}}_{{t}_{j+1}},{\psi}_{j+1}({\varphi}_{j}(\theta ;{\widehat{S}}_{{t}_{j}},x)))$  
${\mathrm{\hspace{1em}\hspace{1em}}\mathit{\hspace{1em}\hspace{1em}\hspace{1em}\hspace{1em}}{\varphi}_{j}(\theta ;{\widehat{S}}_{{t}_{j}},x)({F}_{{t}_{j+1}}{F}_{{t}_{j}}){\widehat{V}}_{j}(\theta ;{\widehat{S}}_{{t}_{j}},x))}^{2}\mid {F}_{{t}_{j}}],$  (5.7) 
$\underset{p\in \mathbb{R},{\mathrm{\Delta}}_{0}\in [l,l]}{\mathrm{arg}\mathrm{min}}?[{({U}_{1}({\widehat{S}}_{{t}_{1}},{\psi}_{1}({\mathrm{\Delta}}_{0})){C}_{0}({F}_{{t}_{1}}{F}_{{t}_{0}})p)}^{2}.]$ 
Remark 5.1.
We create a single network for ${\widehat{V}}_{j}$ and ${\widehat{\mathrm{\Delta}}}_{j}$, letting ${\widehat{V}}_{j}$ depend on ${\widehat{\mathrm{\Delta}}}_{{t}_{j1}}$ (the hedging position at the previous date). In this martingale case, it would have been possible to use two networks, the second being used to represent $V$ as a function of $\widehat{S}$ only.
Remark 5.2.
The hedging position $x$ (normalized in ${[0,1]}^{d}$) is sampled uniformly in the algorithm. The ${\widehat{S}}_{{t}_{j}}$ are sampled according to their own empirical laws, and the ${\widehat{S}}_{{t}_{j+1}}$ are sampled conditionally on the ${\widehat{S}}_{{t}_{j}}$.
Remark 5.3.
The output of the neural network has unbounded values. In order to satisfy the constraints on the hedging positions, a $\mathrm{tanh}$ activation on the output layer, $\mathrm{tanh}(\widehat{C}({\theta}_{j};{\widehat{S}}_{{t}_{j}},{\widehat{\mathrm{\Delta}}}_{j}))$, leads to an output in ${[1,1]}^{d}$.
5.2 Second local algorithm
The second local algorithm can be seen as a path generalization of the first, in which an optimization is performed at each time step in order to calculate the value function and the command for that step using the previously calculated commands. In the second local algorithm, the gain functional $\overline{R}$ is updated for each $\omega $. Consequently, $\overline{R}$ at date ${t}_{i}$, with an asset value ${S}_{{t}_{i}}$ for an investment ${\mathrm{\Delta}}_{i1}$ chosen at date ${t}_{i1}$, satisfies
$\overline{R}({t}_{i},{S}_{{t}_{i}},{\mathrm{\Delta}}_{i1})$  $=g({S}_{T}){\displaystyle \sum _{k=1}^{d}}{\displaystyle \sum _{j=i}^{N1}}{\mathrm{\Delta}}_{j}^{k}({F}_{{t}_{j+1}}^{k}{F}_{{t}_{j}}^{k}),$  
$=\overline{R}({t}_{i+1},{S}_{{t}_{i+1}},{\mathrm{\Delta}}_{i}){\displaystyle \sum _{k=1}^{d}}{\mathrm{\Delta}}_{i}^{k}({F}_{{t}_{i+1}}^{k}{F}_{{t}_{i}}^{k}),$ 
and, as shown in Warin (2019), at date ${t}_{i}$ the optimal control $\mathrm{\Delta}$ is associated with the minimization problem:
$\underset{(V,\mathrm{\Delta})\in \mathbb{R}\times {\mathbb{R}}^{d}}{\mathrm{min}}?\left[{(\overline{R}({t}_{i+1},{S}_{{t}_{i+1}},\mathrm{\Delta}){\displaystyle \sum _{k=1}^{d}}{\mathrm{\Delta}}^{k}({F}_{{t}_{i+1}}^{k}{F}_{{t}_{i}}^{k})V)}^{2}\right{\mathcal{F}}_{{t}_{i}}].$ 
This leads to the second local algorithm (Algorithm 3).
${\theta}_{j}^{*}$  $=\underset{\theta}{\mathrm{arg}\mathrm{min}}?\left[{(g({S}_{T}){\displaystyle \sum _{k=j}^{N1}}{\mathrm{\Delta}}_{k}({F}_{{t}_{k+1}}{F}_{{t}_{k}}){\widehat{V}}_{j}(\theta ;{\widehat{S}}_{{t}_{j}},x))}^{2}\right{S}_{{t}_{j}}],$  (5.8) 
${\mathrm{\Delta}}_{j}$  $={\varphi}_{j}(\theta ;{\widehat{S}}_{{t}_{j}},x),$  
${\mathrm{\Delta}}_{k+1}$  $={\varphi}_{k+1}({\theta}_{k+1}^{*},{\widehat{S}}_{{t}_{k+1}},{\psi}_{k+1}({\mathrm{\Delta}}_{k}))\mathit{\hspace{1em}}\text{for}k\in [j,N2]$  
and  
${\varphi}_{k}(\theta ;{\widehat{S}}_{{t}_{k}},x)$  $={\psi}_{k}^{1}(x)+l\mathrm{tanh}({\widehat{C}}_{k}(\theta ,{\widehat{S}}_{{t}_{k}},x))\mathit{\hspace{1em}}\text{for}k\in [j,N1].$ 
$\underset{p\in \mathbb{R},{\mathrm{\Delta}}_{0}\in [l,l]}{\mathrm{arg}\mathrm{min}}?\left[{(g({S}_{T}){\displaystyle \sum _{k=0}^{N1}}{\mathrm{\Delta}}_{k}({F}_{{t}_{k+1}}{F}_{{t}_{k}})p)}^{2}\right].$ 
Each optimization is achieved using a stochastic gradient descent. Notice that the second local algorithm is far more costly than the first, as, at each time step, command values have to be evaluated from the current time to the maturity of the asset to hedge.
5.3 Parameters for the local algorithm
The parameters used in the optimization process are given below.
 •
At each time step, we use a classical feedforward network of four layers (one input layer, two hidden layers and one output layer) with 12 neurons. The first three layers use an exponential linear unit (ELU) activation function, while the output layer uses an identity activation function.
 •
The batch size (ie, the number of simulations we use at each iteration to produce an Adam gradient update) is 2000.
 •
At each time step, the number of iterations used is limited to a number that increases with the dimensions of the problem, from 5000 for the fourdimensional problem to 25 000 for problems in which the dimensions strictly exceed four.
 •
The initial learning rate is ${10}^{3}$.
6 Numerical results in the case of zero transaction costs using mean squared error
In this section, we compare the three algorithms based on machine learning against a tool based on stochastic control known as StOpt (Gevret et al 2016), using a thin discretization to evaluate the optimal variance.
6.1 Spread option payoff description
We use spread option problems to compare the three algorithms. The payoff in this section is defined for $d\ge 2$ by
$g({S}_{T})={?}_{T}{\left({F}_{T}^{1}{\displaystyle \frac{1}{d1}}{\displaystyle \sum _{i=2}^{d}}{F}_{T}^{i}K\right)}^{+}.$  (6.1) 
The following parameters are used for all three algorithms.
 •
The maturity is $T=90$ days.
 •
$K$ is equal to $10$.
 •
The number of hedging dates $N$ is equal to $14$ (but the control on the last hedging date is trivial).
 •
The liquidity ${l}^{i}$ (ie, the maximum quantity we can buy or sell) at each date is equal to $0.2$ for all underlying assets.
 •
${F}_{0}^{1}=40$ days, ${\sigma}_{1,E}=0.004136$ days and ${a}_{1,E}=0.0002$ days.
 •
The initial load associated with the option satisfies ${?}_{0}=1$.
The three cases take the following specific parameters.
Case 1 ($d=2$, ${\sigma}_{?}=0$) This is a fourdimensional case (having two assets – referred to as futures 1 and 2 in Figures 6–8 and 13–14 – and two hedging positions) with
 •
${F}_{0}^{2}=30$ days, ${\sigma}_{2,E}=0.003137$ days and ${a}_{2,E}=0.0001$ days; and
 •
${\rho}_{1,2}=0.7$ is the correlation between the two assets.
Case 2 ($d=2$) This is a fivedimensional case, with the same parameters as the first case but with a varying load that has the parameters ${\sigma}_{?}=0.02$ days and ${a}_{?}=0.02$ days. The correlation between each of the tradable assets and the nontradable asset $?$ is equal to $0.2$. Note that this is the only case where ${\sigma}_{v}>0$.
Case 3 ($d=3$, ${\sigma}_{?}=0$) This is a sixdimensional case (having three assets, referred to as futures 1, 2 and 3 in the below figures) with
 •
${F}_{0}^{2}=35$ days, ${\sigma}_{2,E}=0.003137$ days and ${a}_{2,E}=0.0001$ days;
 •
${F}_{0}^{3}=25$ days, ${\sigma}_{3,E}=0.005136$ days and ${a}_{3,E}=0.0001$ days; and
 •
the correlation between assets $i$ and $j$ is denoted by ${\rho}_{i,j}$ and satisfies ${\rho}_{1,2}=0.7$, ${\rho}_{1,3}=0.3$, ${\rho}_{2,3}=0.5$.
6.1.1 Numerical results
In Table 3, the variance obtained for 100 000 common simulations is given for the three algorithms and compared with the variance obtained by the StOpt library. Notice that, because of the size of the problem, complete convergence for case 3 could not be obtained with the StOpt library.
For the two local algorithms (Algorithms 2 and 3), we run the optimization 10 times and take the best variance obtained.
The global algorithm (Algorithm 1) is far more effective in terms of computing time than the local algorithms, as 10 000 iterations run in 220 seconds on the graphics card of a Core i3 laptop, while the two local algorithms can take some hours to process case 3.
Mean squared error  Case 1  Case 2  Case 3 

Unhedged portfolio  8.3058  8.5250  10.5960 
Hedged with StOpt  0.3931  0.5160  0.4983 
Hedged with global algorithm  0.3920  0.5205  0.4852 
Hedged with first local algorithm  0.3971  0.5168  0.4763 
Hedged with second local algorithm  0.3912  0.5183  0.4943 
Stochastic  First  Second  
control  Global  local  local 
0.271  0.265  0.259  0.262 
In Figure 5, the losses for the market spreads using the global algorithm are plotted.
Figures 6–8 show the deltas for the underlying assets for three random price simulations (simulations numbered 1–3) for all four algorithms. For the two and threemarket spreads, all algorithms converge to deltas that have the same shape.
The numerical results indicate that the global and local algorithms give similar results. In low dimensions, the solutions obtained over 10 runs are pretty stable. But as the dimensions increase, the results obtained with the local algorithm tend to vary more from one run to another. Additionally, as the dimensions increase, the local algorithm requires a large increase in iterations at each time step, leading to a noncompetitive running time compared with the global algorithm.
The global algorithm is still very effective in six dimensions, and, being able to solve the problem very quickly, is a candidate for providing a method to solve problems in very high dimensions.
One question that arises is how the three neuralnetworkbased algorithms perform when there is an increase in the number of decisions, ie, the number of hedging dates. To increase the number of hedging dates we can increase the maturity $T$ while keeping the same distance between two hedging dates. Because of the meanreverting nature of the chosen models, a more complex case consists in keeping $T=90$ days while increasing the number of hedging dates. In Table 4 we compute the error of the four algorithms with 28 hedging dates (instead of the previous 14) and a liquidity of 0.15 units per date (instead of 0.20). The three neural network approaches are effective in terms of accuracy. The time taken by the second local algorithm (Algorithm 3) explodes due to the resimulation at each date of the optimal strategy until maturity.
6.2 Testing different risk criteria
One of the advantages of the global neural network approach is its flexibility. There are no particular limitations on the models used (they may be Markovian or not, Gaussian or not, etc), and we can choose different loss functions. In this section, we derive the optimal controls from different loss functions.
In Figures 9–11, we plot the distribution (constructed with a kernel density estimate method) of the hedged portfolio using the loss functions defined in (2.4)–(2.6). In general, the two nonsymmetric loss functions give shapes for the distributions that differ from those given by the mean square loss function. On the lefthand side, both the asymmetrical loss curve and the moment 2/moment 4 loss curve are below the mean square loss curve. On the extreme lefthand tail (shown, for example, in Figure 11) the mean square loss function is the only one that is represented: extreme losses are avoided by moment 2/moment 4 and asymmetrical loss functions. This is compensated for around the average (the middle of the distribution), where the losses for the two nonsymmetric loss functions are more minor. Some of the distribution mass is shifted to the righthand side (the gain side). This is an attractive side effect: compared with mean squared error, moment 2/moment 4 and asymmetrical loss functions tend to favor gains.
7 Numerical results for the option hedging problem with transaction costs
In this section, we investigate the effect of transaction costs when implemented in the global algorithm. We consider the cost of selling or buying a volume $k$ of ${F}^{i}$ to be equal to $k{c}^{i}$, ${c}^{i}\ge 0$. As we are selling the derivative, the terminal wealth of the strategy ${X}_{T}$ and the associated transaction costs ${Y}_{T}$ are equal to
${X}_{T}$  $=p+{\displaystyle \sum _{j=1}^{d}}{\displaystyle \sum _{i=1}^{N1}}{\mathrm{\Delta}}_{{t}_{i}}^{j}({S}_{{t}_{i+1}}^{j}{S}_{{t}_{i}}^{j}),{Y}_{T}={\displaystyle \sum _{j=1}^{d}}{\displaystyle \sum _{i=1}^{N1}}{\mathrm{\Delta}}_{{t}_{i}}^{j}{\mathrm{\Delta}}_{{t}_{i1}}^{j}{c}_{j}.$ 
We use the criterion defined by
${d}^{\alpha}({X}^{\mathrm{\Delta}},g({S}_{T}))=(1\alpha )?[{Y}_{T}]+\alpha \sqrt{?[{({X}_{T}g({S}_{T}))}^{2}]},\alpha \in [0,1].$  (7.1) 
This criterion describes a tradeoff between risklimitation and hedging costs. If $\alpha =1$, the criterion is equivalent to the variance minimization studied in Section 6.1.1; if $\alpha =0$, we simply minimize transaction costs regardless of risks (which corresponds to doing nothing).
$\alpha \in [0,1]$ is a parameterization of the Pareto frontier of the minimization tradeoff between risk and transaction costs. This problem is a portfoliomanagement problem, where $p$ is an input (and therefore not optimized) that we take equal to $?[g({S}_{T})]$ in our numerical tests.
7.1 Training the Pareto frontier
Instead of training $N$ versions of the neural network for $N$ values of $\alpha $, we propose to add $\alpha $ to the input variables of the neural network (see Figure 4) and to randomly pick a value of $\alpha $ following a random uniform distribution $?(0,1)$ at each training iteration. By doing this, we add a dimension to the problem but we obtain the optimal strategy for all $\alpha \in [0,1]$ at once. This differs from traditional algorithms, in which it is often preferred to evaluate $N$ functions defined on ${\mathbb{R}}^{K}$ instead of one function defined on ${\mathbb{R}}^{K+1}$. Getting the whole Pareto frontier is appealing for many reasons; for example, it allows us to retrieve the $\alpha $ corresponding to an expected transactioncost target budget.
To obtain the Paretofrontier estimate, we increase the width of the neural network (with three hidden layers of 50 neurons, instead of the previous 10, for the projection part of the LSTM), and then run 100 000 iterations of minibatch gradient descent, instead of the 20 000 iterations that until now were sufficient. $\alpha $ is generated from a Sobol quasirandom generator.
7.2 Numerical results
We consider the market spread options in cases 2 and 3 described in Section 6.1. The transaction cost is the same for all tradable risk factors and is set to 0.02 per unit of traded volume. In Figure 12 we plot the resulting average transaction cost and variance of hedgedportfolio values for different values of $\alpha $. As expected, when $\alpha \sim 1$, the strategy gives similar results to the pure variance minimization of Section 6.1.1; when $\alpha \sim 0$, we obtain results corresponding to a unhedged portfolio. In Figures 13 and 14, the deltas for cases 2 and 3 are plotted for two random price simulations with several values of $\alpha $. For lower values of $\alpha $, the algorithm prefers to reduce the control amplitude in order to reduce transaction costs.
8 Conclusion and perspectives
Three algorithms based on neural networks (one global algorithm and two local algorithms) dedicated to the hedging of contingent claims are proposed. The three algorithms show good results compared with techniques based on stochastic control. In particular, the global algorithm is interesting both in terms of execution speed and flexibility.
The global algorithm is tested with different wellknown loss functions, and the use of an LSTM architecture in the global algorithm would allow the use of nonMarkovian underlying models. Further, we propose a methodology for drawing a Pareto frontier. We apply this methodology to the tradeoff between maximizing the mean and minimizing variance in the case involving transaction costs (parameterized by an $\alpha $ combining mean and variance in the objective function). The advantage of getting the whole Pareto frontier is threefold.
 •
It increases inference speed, as we do not need to retrain the algorithm with different parameterizations.
 •
It becomes easy to perform reverse engineering (eg, to get which $\alpha $ corresponds to a target transactioncost budget).
 •
It becomes easier to perform sensitivity analysis.
The drawback of the global algorithm when compared with algorithms based on stochastic control is the lack of convergence proof. However, the global algorithm allows the treatment of cases that are not attainable by any other technique.
Declaration of interest
The authors report no conflicts of interest. The authors alone are responsible for the content and writing of the paper.
References
 Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., G. S. Corrado, Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Mane, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V., Vi, Fégas, Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., and Zheng, X. (2015). TensorFlow: largescale machine learning on heterogeneous systems. Preprint (arXiv:1603.04467).
 Bachouch, A., Huré, C., Langrené, N., and Pham, H. (2018). Deep neural networks algorithms for stochastic control problems on finite horizon: numerical applications. Preprint (arXiv:1812.05916).
 Beck, C., E, W., and Jentzen, A. (2017). Machine learning approximation algorithms for highdimensional fully nonlinear partial differential equations and secondorder backward stochastic differential equations. Preprint (arXiv:1709.05963).
 Bouchard, B., Tan, X., and Warin, X. (2017). Numerical approximation of general Lipschitz BSDEs with branching processes. Preprint (arXiv:1710.10933).
 Carmona, R. (2008). Indifference Pricing: Theory and Applications. Princeton University Press (https://doi.org/10.1515/9781400833115).
 ChanWaiNam, Q., Mikael, J., and Warin, X. (2019). Machine learning for semi linear PDEs. Journal of Scientific Computing 79(3), 1667â1712 (https://doi.org/10.1007/s10915019009083).
 Chung, J., Gulcehre, C., Cho, K., and Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. Preprint (arXiv:1412.3555).
 E, W., Han, J., and Jentzen, A. (2017). Deep learningbased numerical methods for highdimensional parabolic partial differential equations and backward stochastic differential equations. Communications in Mathematics and Statistics 5(4), 349–380 (https://doi.org/10.1007/s4030401701176).
 Föllmer, H., and Leukert, P. (1999). Quantile hedging. Finance and Stochastics 3(3), 251–273 (https://doi.org/10.1007/s007800050062).
 Gatheral, J. (2010). Nodynamicarbitrage and market impact. Quantitative Finance 10(7), 749–759 (https://doi.org/10.1080/14697680903373692).
 Germain, M., Pham, H., and Warin, X. (2020). Deep backward multistep schemes for nonlinear PDEs and approximation error analysis. Preprint (arXiv:2006.01496).
 Gevret, H., Lelong, J., and Warin, X. (2016). StOpt, an open source library for stochastic control. Software library. URL: https://gitlab.com/stochasticcontrol/StOpt.
 Gobet, E., Pimentel, I., and Warin, X. (2018). Option valuation and hedging using asymmetric risk function: asymptotic optimality through fully nonlinear partial differential equations. Working Paper. URL: https://hal.archivesouvertes.fr/hal01761234.
 Han, J., Jentzen, A., and E, W. (2018). Solving highdimensional partial differential equations using deep learning. Proceedings of the National Academy of Sciences of the USA 115(34), 8505–8510 (https://doi.org/10.1073/pnas.1718942115).
 Hochreiter, S., and Schmidhuber, J. (1997). Long shortterm memory. Neural Computation 9(8), 1735–1780 (https://doi.org/10.1162/neco.1997.9.8.1735).
 Hornik, K., Stinchcombe, M., and White, H. (1990). Universal approximation of an unknown mapping and its derivatives using multilayer feedforward networks. Neural Networks 3(5), 551–560 (https://doi.org/10.1016/08936080(90)900056).
 Huré, C., Pham, H., Bachouch, A., and Langrené, N. (2018) Deep neural networks algorithms for stochastic control problems on finite horizon: convergence analysis. Preprint (arXiv:1812.04300).
 Huré, C., Pham, H., and Warin, X. (2019). Some machine learning schemes for highdimensional nonlinear PDEs. Preprint (arXiv:1902.01599).
 Kabanov, Y. and Safarian, M. (2009). Markets with Transaction Costs: Mathematical Theory. Springer (https://doi.org/10.1007/9783540681212).
 Kingma, D. P., and Ba, J. (2014). Adam: a method for stochastic optimization. Preprint (arXiv:1412.6980).
 Lehalle, C.A., and Laruelle, S. (2013). Market Microstructure in Practice. World Scientific (https://doi.org/10.1142/8967).
 Leland, H. E. (1985). Option pricing and replication with transactions costs. Journal of Finance 40(5), 1283–1301 (https://doi.org/10.1111/j.15406261.1985.tb02383.x).
 Liang, S. (2017). Why deep neural networks for function approximation? Preprint (arXiv:1610.04161).
 Olah, C. (2015). Understanding LSTM networks. Blog post. URL: https://colah.github.io/posts/201508UnderstandingLSTMs.
 Potters, M., and Bouchaud, J.P. (2003). More statistical properties of order books and price impact. Physica A 324(1–2), 133–140 (https://doi.org/10.1016/S03784371(02)018964).
 Schweizer, M. (1999). A guided tour through quadratic hedging approaches. Discussion Paper, Interdisciplinary Research Project 373, Humboldt University of Berlin.
 Tankov, P. (2003). Financial Modelling with Jump Processes, Volume 2. CRC Press, Boca Raton, FL (https://doi.org/10.1201/9780203485217).
 Tieleman, T. and Hinton, G. (2012). Lecture 6.5rmsprop: divide the gradient by a running average of its recent magnitude. Lecture. COURSERA: Neural Networks for Machine Learning 4(2), 26–31.
 Toft, K. B. (1996). On the mean–variance tradeoff in option replication with transactions costs. Journal of Financial and Quantitative Analysis, 31(2), 233–263 (https://doi.org/10.2307/2331181).
 Warin, X. (2019). Variance optimal hedging with application to electricity markets. The Journal of Computational Finance, 23(3), 33–59 (https://doi.org/10.21314/JCF.2019.376).
 Zeiler, M. D. (2012). Adadelta: an adaptive learning rate method. Preprint (arXiv:1212.5701).
Only users who have a paid subscription or are part of a corporate subscription are able to print or copy content.
To access these options, along with all other subscription benefits, please contact info@risk.net or view our subscription options here: http://subscriptions.risk.net/subscribe
You are currently unable to print this content. Please contact info@risk.net to find out more.
You are currently unable to copy this content. Please contact info@risk.net to find out more.
Copyright Infopro Digital Limited. All rights reserved.
As outlined in our terms and conditions, https://www.infoprodigital.com/termsandconditions/subscriptions/ (point 2.4), printing is limited to a single copy.
If you would like to purchase additional rights please email info@risk.net
Copyright Infopro Digital Limited. All rights reserved.
You may share this content using our article tools. As outlined in our terms and conditions, https://www.infoprodigital.com/termsandconditions/subscriptions/ (clause 2.4), an Authorised User may only make one copy of the materials for their own personal use. You must also comply with the restrictions in clause 2.5.
If you would like to purchase additional rights please email info@risk.net