Differential machine learning: the shape of things to come

A derivative pricing approximation method using neural networks and AAD speeds up calculations by orders of magnitude

CLICK HERE TO DOWNLOAD THE PDF

Brian Huge and Antoine Savine combine automatic adjoint differentiation with modern machine learning. In addition, they introduce general machinery for training fast, accurate pricing and risk approximations, applicable to arbitrary transactions or trading books, and arbitrary stochastic models, effectively resolving the computational bottlenecks of derivatives risk reports and regulations

Pricing approximation has proved tremendously useful with advanced stochastic models. Many models do not allow fast, closed-form pricing and compute prices with slower numerical methods. Approximate analytics, such as the SABR formula (Hagan et al 2002), provide fast pricing in return for some degree of approximation, making it possible to practically apply sophisticated models on trading desks. Although researchers traditionally derived pricing approximations by hand, automated approximations inspired by machine learning (ML) have emerged in recent years (Horvath et al 2019; McGhee 2018; Ferguson & Green 2018); here, machines learn to encode fast pricing approximations from examples produced by numerical methods.

This article creates a general pricing and risk approximation factory, applicable to arbitrary derivatives instruments and stochastic models, where the simulation of the training set and the learning process are fast and reliable enough for online implementation, in real time, as part of a risk computation. Applications include custom risk reports, backtesting and regulations such as XVA, CCR, FRTB and Simm-MVA.11 1 XVA: counterparty value adjustment (CVA) and friends. CCR: counterparty credit risk. FRTB: Fundamental Review of the Trading Book. Simm-MVA: standardised initial margin model-margin valuation adjustment.

We follow the classic ideas of Longstaff & Schwartz (2001) and Carriere (1996) and train machines with datasets of simulated cashflows. We apply modern ML in place of classic regression, along the lines of Lapeyre & Lelong (2019), for example.

The main breakthrough here is that we are training ML models using not only simulated cashflows but also the differentials of cashflows with respect to initial state variables, also known as pathwise differentials. Automatic adjoint differentiation (AAD) has provided the derivatives industry with differentials that are computed behind the scenes, automatically, accurately to machine precision, and very efficiently, for a computation cost not much higher than that of cashflow simulation (see Giles & Glasserman 2006). The present article discusses how to train ML models with pathwise differentials as well as why this makes a dramatic difference in pricing and risk approximation quality. Our numerical examples illustrate the capability of differential ML to reliably train ML models on small datasets simulated in real time.

A more complete write-up is available as a preprint on arXiv (Huge & Savine 2020). A TensorFlow implementation is available on GitHub,22 2 See https://github.com/differential-machine-learning/notebooks. along with the reproduction of some numerical examples from this article as well as practical implementation details not covered in the text. We have also posted appendixes with some important additions, including mathematical justifications and proofs, and extensions to other ML frameworks, such as regression (where differentials provide effective regularisation) and principal component analysis (where differentials allow the extraction of the main risk factors of a transaction or a trading book). The questions of asymptotic control and convergence guarantee, critical for reliable deployment in production, are also addressed in these appendixes.

Training pricing approximations on simulated datasets

Machines learn pricing approximations from training sets of m examples (x(i),y(i)), 1im. The training inputs x(i) are the initial state vectors in dimension n, and they are often in the hundreds or thousands. The training labels y(i) are final payouts, defined as the discounted sum of all the cashflows in the instrument or trading book, simulated with an implementation of the stochastic model, with initial condition x(i). Hence, y(i)=h(x(i);ξ(i)), where h is a Monte Carlo (MC) implementation under the pricing measure, and ξ(i) are the random numbers for path number i. Therefore, y(i) are independent, unbiased (but noisy) estimates of the correct pricing function f(x)=E[h(x;ξ)] evaluated in x(i), which justifies the practice of training pricing approximations on simulated cashflows.

The computation of each training example takes one MC path; hence, the computational cost to simulate the entire training set is similar to one pricing by MC. By contrast, training sets of ground-truth price labels require numerical methods for each training example, so the production of the dataset is too slow for online application.

The purpose of the training set (x(i),y(i)) is to learn an approximation f^(x) as close as possible to the correct pricing function f(x)=E[YX=x]. One definition of the conditional expectation E[YX] is the function of X closest to Y in L2, the L2 distance between f^(X) and Y being estimated by the mean squared error (MSE) between the predictions f^(x(i)) and labels y(i). It follows that approximations trained by minimisation of the MSE converge near the correct pricing functions in the limit of an infinite training set.33 3 How ‘near’ depends on two things. First is the capacity of the approximation, that is, the size of the subspace of functions it is capable of representing. Neural networks are universal approximators (Hornik et al 1989), and deep neural networks, in particular, are notoriously high capacity approximators (Dixon 2019); hence, they are well suited to this context. Second is the ability of the training algorithm to find the global minimum of the MSE. This is critical for unsupervised online operation and is addressed in appendix 4 online.

Overcoming limitations with differentials

Longstaff & Schwartz (2001) and Carriere (1996) recommend linear approximators. However, recent literature has largely illustrated the superiority of modern ML models such as deep neural networks, which are high capacity models, resilient in high dimension. Nevertheless, we found that the performance of modern ML remains insufficient in the context of complex transactions and trading books, where a vast number of training examples (often in the hundreds of thousands or millions) is necessary to learn accurate approximations, so the dataset cannot be simulated in real time. Our conclusion is consistent with the findings of Horvath et al (2019), Ferguson & Green (2018) and McGhee (2018), whose training consumes vast datasets and happens offline. High capacity ML models are prone to overfitting and require large training sets, even with regularisation.

We also observed that risk sensitivities converge considerably more slowly than values and often remain blatantly wrong, even with hundreds of thousands of training examples. Risk sensitivities are defined as the gradients of the (true) pricing function f(x), and they are estimated as the gradients of the approximation f^(x). Unfortunately, as is well known in numerical analysis, the derivatives of a good approximation are not always a good approximation of the derivatives.

The challenge addressed here is therefore to learn accurate approximations of pricing and risk functions, and to overcome overfitting, with training sets of limited size simulated in realistic time. Our main proposition is to augment training sets with pathwise differentials. Recall that training labels are payouts simulated with training inputs as initial conditions: y(i)=h(x(i);ξ(i)). Pathwise differentials are defined as the gradients of payouts with respect to initial state:

  x¯(i)=h(x(i);ξ(i))x(i)=y(i)x(i)  

Those differentials, denoted x¯(i)n, are defined path by path, hence the name pathwise differentials. They are also the differentials of training labels with respect to training inputs. We sometimes call them differential labels.

The most effective way to compute pathwise differentials is AAD. We do not cover AAD in this article, and refer instead to the founding article Giles & Glasserman (2006) and the vast amount of literature that has followed. Pathwise differentials carry more information than is typically used, eg, by averaging for risk reports. They tell us how the initial state affects cashflows in many different scenarios. We can leverage this information in many ways. The algorithms in what follows learn high-quality pricing functions precisely from differential labels. Differentials make a spectacular difference in approximation performance, particularly with smaller datasets. Differentiation by AAD or bumping requires cashflow smoothing to obtain well-defined pathwise derivatives.

Differential machine learning

Notation

To avoid unnecessarily heavy notation, we work in the context of feedforward neural networks, but the findings carry over to other ML models in a straightforward fashion. The mathematical structure of feedforward networks greatly simplifies the exposition. Besides, neural networks are well suited to pricing approximation, due to their high capacity and resilience in high dimension, as previously noted. Finally, trained neural networks are capable of very efficient evaluation and differentiation. Neural nets predict values and derivatives with almost analytic speed, making them suitable for use on complex risk reports, backtesting and regulatory capital.

Let us briefly recall the evaluation and differentiation of feedforward networks to establish the necessary notation for what follows.

Feedforward equations

Define the input (row) vector xn and the predicted value y. For every layer l=1,,L in the network, define a scalar ‘activation’ function gl-1:, applied element-wise over vectors or matrices. We denote by wlnl-1×nl, blnl the weights and biases of layer l.

The network is defined by its feedforward equations:

  z0=xzl=gl-1(zl-1)wl+bl,l=1,,Ly=zL}   (1)

where zlnl is the row vector containing the nl pre-activation values, also called units or neurons, in layer l.

Backpropagation

Feedforward networks are efficiently differentiated by backpropagation, which is generally applied to compute the derivatives of some cost function with respect to the weights and biases of the network. For now, we are not interested in those differentials but in the differentials of the predicted value y=zL with respect to the inputs x=z0. Recall that inputs are initial states and predictions are approximate prices; hence, these differentials predict risk sensitivities (Greeks). Backpropagation of (1) is of the form:

  z¯L=y¯=1z¯l-1=(z¯lwlT)gl-1(zl-1),l=L,,1x¯=z¯0}   (2)

with the adjoint notation x¯=y/x,z¯l=y/zl,y¯=y/y=1, and is the element-wise (Hadamard) product.

Figure 1: Feedforward neural network and backpropagation, unrolled in a twin network.
Risk 1020 Huge tech fig 1

Click here to enlarge the image

Notice the similarity between (1) and (2). Backpropagation defines a second feedforward network with inputs y¯, z0,,zL, and output x¯n that share weights with the first network, where the units in the second network are the adjoints of the corresponding units in the original network. This realisation leads to a combined twin network for the prediction of values and risks.

Figure 1 illustrates a feedforward network with L=3 and n=n0=3, n1=5, n2=3, n3=1, together with backpropagation (left). It also shows how the forward and reverse passes are unrolled in a twin network (right).

Twin networks

We can combine feedforward (1) and backpropagation (2) equations into a single network representation, or twin network, corresponding to the computation of a prediction (approximate price) together with its differentials with respect to inputs (approximate risk sensitivities).

The first half of the twin network (see figure 1, right) is the original network, traversed with feedforward induction to predict a value. The second half is computed with the backpropagation equations to predict risk sensitivities. It is the mirror image of the first half, with shared connection weights.

A mathematical description of the twin network is simply obtained by concatenation of (1) and (2). Evaluating the twin network returns a predicted value y and its differentials x¯ with respect to the n0=n inputs x. The combined computation evaluates a feedforward network of twice the initial depth. Like feedforward induction, backpropagation computes a sequence of matrices by vector products. The twin network, therefore, predicts prices and risk sensitivities with twice the computational complexity of value prediction alone, irrespective of the number of Greeks. Hence, a trained twin net approximates prices and risk sensitivities with respect to potentially many states in a particularly efficient manner.

Note from (2) that the units of the second half are activated with the differentials gl of the original activations gl. To backpropagate through the twin network, we need continuous activation throughout. Hence, the initial activation must be C1, ruling out, for example, ReLU.

A TensorFlow implementation of the twin network is available on GitHub44 4 See https://github.com/differential-machine-learning/notebooks/blob/master/DifferentialML.ipynb. along with a detailed discussion of the practical implementation details, which we skip here. That notebook also implements the training of the twin network with a dataset augmented with differential labels, which we will discuss next.

Training with differential labels

The purpose of the twin network is to estimate the correct pricing function f(x) by an approximate function f^(x;{wl,bl}l=1,,L). It learns optimal weights and biases from an augmented training set (x(i),y(i),x¯(i)), where x¯(i)=y(i)/x(i) are the differential labels.

Here, we describe the mechanics of differential training and discuss its effectiveness. As is customary with ML, we stack training data in matrices, with examples in rows and units in columns:

  X=(x(1)x(m))m×n,Y=(y(1)y(m))m,X¯=(x¯(1)x¯(m))m×n  

Note that (1) and (2) apply to matrices or row vectors identically. Hence, the evaluation of the twin network computes the matrices:

  Zl=(zl(1)zl(m))m×nlandZ¯l=(z¯l(1)z¯l(m))m×nl  

respectively in the first and second half of its structure. Training consists in finding weights and biases minimising some cost function C:

  {wl,bl}l=1,,L=argminC({wl,bl}l=1,,L)  

Classic training with payouts alone

Recall classic deep learning. We have seen that the approximation obtained by global minimisation of the MSE converges to the correct pricing function (modulo finite capacity bias), hence:

  C({wl,bl}l=1,,L)=MSE=1m(ZL-Y)T(ZL-Y)  

The second half of the twin network does not affect cost; hence, training is performed by backpropagation through the standard feedforward network. A conventional deep learning algorithm, the practical details of which are covered in the demonstration notebook, trains the network by minimising the MSE.

Differential training with differentials alone

Let us change gears and train with pathwise differentials x¯(i) instead of payouts y(i), by minimisation of the MSE (denoted MSE¯) between the differential labels (pathwise differentials) and the predicted differentials (estimated risk sensitivities):

  C({wl,bl}l=1,,L)=MSE¯=1mtr[(Z¯0-X¯)T(Z¯0-X¯)]  

Here, we must evaluate the twin network in full to compute Z¯0, effectively doubling the cost of training. Gradient-based methods minimise MSE¯ by backpropagation through the twin network, effectively accumulating second-order differentials in its second half. A deep learning framework such as TensorFlow performs this computation seamlessly. As we have seen, the second half of the twin network may represent backpropagation; in the end, this is just another sequence of matrix operations, easily differentiated by another round of backpropagation, carried out silently and behind the scenes. The implementation in the demonstration notebook is identical to training with payouts, save for the definition of the cost function. TensorFlow automatically invokes the necessary operations, evaluating the feedforward network when minimising MSE and the twin network when minimising MSE¯.

In practice, we must also assign appropriate weights to the costs of wrong differentials in the definition of the MSE¯, as covered in the implementation notebook and discussed on Github.

Let us now discuss what it means to train approximations by minimising the MSE¯ between pathwise differentials x¯(i)=y(i)/x(i) and predicted risks f^(x(i))/x(i). Given appropriate smoothing, expectation and differentiation commute so the (true) risk sensitivities are expectations of pathwise differentials:

  f(x)x=E[YX=x]x=E[YX|X=x]  

It follows that pathwise differentials are unbiased estimates of risk sensitivities, and approximations trained by minimising the MSE¯ converge (modulo finite capacity bias) to a function with correct differentials; hence, the right pricing function, modulo an additive constant.

Therefore, we can choose to train by minimisation of value or derivative errors and converge near the correct pricing function all the same. This consideration is, however, an asymptotic one. Training with differentials converges near the same approximation, but it converges much faster, allowing us to train accurate approximations with much smaller datasets, as we will see in the numerical examples. This is because of the following.

  • The effective size of the dataset is much larger: with m training examples, we have nm differentials (n being the dimension of the inputs x(i)). With AAD, we effectively simulate a much larger dataset for a minimal additional cost, especially in high dimension (where classical training struggles most).

  • The neural nets pick up the shape of the pricing function, learning from slopes rather than points, which results in a much more stable and potent education, even with few examples.

  • The neural approximation learns to produce correct Greeks by construction, not only correct values. By learning the correct shape, the ML approximation also correctly orders values in different scenarios, which is critical in applications such as value-at-risk and expected loss, including for FRTB.

  • Differentials act as an effective, bias-free regularisation, as we will see next.

    Differential training with everything

The best numerical results are obtained by combining values and derivatives errors in the cost function:

  C=MSE+λMSE¯  

which is implemented in the demonstration notebook with the two previous strategies as particular cases. Note the similarity with classic regularisation of the form C=MSE+λ penalty. Ridge (Tikhonov) and Lasso regularisations impose a penalty for large weights (in L2 and L1 metrics, respectively), effectively preventing overfitting in small datasets by stopping attempts to fit noisy labels. In return, classic regularisation reduces the effective capacity of the model and introduces a bias, along with a strong dependency on the hyperparameter λ. This hyperparameter controls regularisation strength and tunes the vastly documented bias-variance trade-off. If one sets λ too high, their trained approximation ends up as a horizontal line.

Differential training also stops attempts to fit noisy labels, with a penalty for wrong differentials. It is, therefore, a form of regularisation, but a very different kind. It does not introduce bias, since we have seen that training on differentials alone converges to the correct approximation too. This breed of regularisation comes without bias-variance trade-off. It reduces variance for free. Increasing λ hardly affects the results in practice.

Differential regularisation is more similar to data augmentation in computer vision, which is, in turn, a more powerful form of regularisation. Differentials are additional training data. Like data augmentation, differential regularisation reduces variance by increasing the size of the dataset for little cost. Differentials are new data of a different type, computed on the same sample paths as existing data, but it reduces variance all the same, and without the introduction of a bias.

Numerical results

Let us now review some numerical results and compare the performance of differential and conventional ML. We picked three examples from relevant textbooks and real-world situations in which neural networks learned pricing and risk approximations from small datasets.

We kept the neural architecture constant in all the examples, with four hidden layers of 20 softplus-activated units. We trained neural networks on mini-batches of normalised data using the Adam optimiser and a one-cycle learning rate schedule. The demonstration notebook and appendixes discuss all the details. A differential training set takes two to five times longer to simulate with AAD, and it takes twice as long to train twin nets than to train standard ones. In return, we are going to see that differential ML performs up to a thousandfold better on small datasets.

Basket options

The first (textbook) example is a basket option in a correlated Bachelier model for seven assets:

  dSt=σdWt  

where St7 and dWtjdWtk=ρjk. The task is to learn the pricing function of a one-year call option on a basket, with strike 110 (we normalised asset prices at 100, without loss of generality, and basket weights sum to 1). The basket price is also Gaussian in this model; hence, Bachelier’s formula gives the correct price. This example is also of particular interest because, although the input space is seven dimensional, we know from maths that actual pricing is one dimensional. Can the network learn this property from data?

Figure 2: Basket option in Bachelier model, seventh dimension.
Risk 1020 Huge tech fig 2

We trained neural networks and predicted values and derivatives in 1,024 independent test scenarios (see figure 2), with initial basket values on the horizontal axis and option prices/deltas on the vertical axis (we show one of the seven derivatives). These are compared with the correct results computed with Bachelier’s formula. We also trained networks on 1,024 (1k) and 65,536 (64k) paths, with cross-validation and early stopping. The twin network with 1k examples performed better for values than the classical net with 64k examples, and a lot better for derivatives. In particular, it learned that the option price and deltas are a fixed function of the basket, as evidenced by the thinness of the approximation curve. The classical network did not learn this property well, even with 64k examples. It overfitted the training data and predicted different values or deltas for various scenarios on the seven assets with virtually identical baskets.

We also compared test errors with standard MC errors (also with 1k and 64k paths). The main point of pricing approximation is to avoid nested simulations with similar accuracy. We see that the error of the twin network is, indeed, close to MC; classical deep learning error is an order of magnitude larger. Finally, we trained with eight million samples and verified that both networks converge to similarly low errors (not zero, due to finite capacity), while the MC error does converge to zero. The twin network gets there hundreds of times faster.

Worst-of autocallables

Our second example is an exotic instrument, a four-underlying version of the popular worst-of autocallable trade, in a more complicated model, a collection of four correlated local volatility models à la Dupire:

  dStj=σj(t,Stj)dWtj,j=1,,4  

where dWtjdWtk=ρjk. The example is relevant not only due to popularity, but also because of the tendency of path dependence, barriers or massive final digitals, which are notoriously hard for numerical algorithms. Smoothing was applied so the pathwise differentials are well defined.

We trained the classical network on 32k samples with cross-validation and early stopping, and the twin network on 8k paths with AAD pathwise derivatives. We generated both sets in around half a second using Superfly, Danske Bank’s proprietary derivatives pricing and risk management system.

In figure 3 we show the results for the value on the leftmost chart and the delta with respect to the second underlying in the middle chart for 128 independent examples, with correct numbers on the horizontal axis. We did not have a closed-form solution for reference; instead, we computed ‘correct’ numbers with nested MC simulations. The predicted prices and deltas are displayed on the vertical axis. Performance is measured by distance to the 45 line.

The twin network with only 8k pieces of training data produces a virtually perfect approximation for prices and a decent approximation for deltas. The classical net also approximates values correctly with 64k paths, although not in a straight line; this may cause problems in applications where correct ordering is critical, such as expected loss or FRTB. Its deltas are essentially random, which disqualifies it from the approximation of risk, eg, from Simm-MVA. Once again, the test error with the differential net is of a similar order to that of a nested MC. The root MSE for the classical net, with four times the training size, is three times larger for values and an order of magnitude larger for differentials.

Derivatives trading books

For our final example, we picked a real netting set from Danske Bank’s portfolio, including single and cross-currency swaps and swaptions in 10 different currencies, eligible for XVA, CCR or other regulated capital computations. We simulated paths in Danske Bank’s model of everything (‘the beast’), with interest rates simulated using four-factor term structure models correlated with both one another and forex rates.

This example is representative of how we want to apply twin nets in the real world. It is also a stress test for neural networks. The Markov dimension of the four-factor non-Gaussian Cheyette model is 16 per currency: this amounts to 160 inputs, 169 including forexes, and over 1k with all the path dependencies. Of course, the value effectively only depends on a small number of factors, which is something a neural net is supposed to identify. In reality, the extraction of risk factors is considerably more effective in the presence of differential labels (see appendix 2 online).

The rightmost chart of figure 3 shows the values predicted by a twin network trained on 8k samples with AAD pathwise derivatives being compared with a vanilla network trained on 64k paths. Performance improvement is evident. The differential net produces a virtually perfect approximation. As with the previous example, the vertical axis displays the predicted values for an independent set of 128 samples, with correct values, obtained by nested MC, on the horizontal axis. The entire training process for the twin network (on an entry-level GPU), including the generation of the 8k examples (on a multi-threaded CPU), took a few seconds on a standard workstation.

Figure 3: Worst-of-four autocallable (left) and real-world netting set (right).
Risk 1020 Huge tech fig 3

We also verified that the differential approximation error is, again, similar to the nested MC error, while the classical deep learning error is an order of magnitude larger.

Extensions

We have presented algorithms in the context of single value prediction to avoid confusion and too much notation. To conclude, we discuss two advanced extensions allowing the network to predict multiple values and higher-order derivatives simultaneously.

Multiple outputs

Horvath et al (2019) predict call prices of multiple strikes and expiries in a single network, exploiting correlation and shared factors, and encouraging the network to learn global features like no-arbitrage conditions. We can combine our approach with this idea by an extension of the twin network to compute multiple predictions, meaning nL>1 and y=zLnL. The adjoints are no longer well defined as vectors. So, we define them as directional differentials with respect to some specified linear combination of the outputs ycT, where cnL has the coordinates of the desired direction in a row vector:

  x¯=ycTx,z¯l=ycTzl,y¯=ycTy=c  

Given a direction c, all the previous equations apply identically, except that the boundary condition for y¯ in the backpropagation equations is no longer 1 but rather the row vector c. For example, c=e1 means that adjoints are defined as derivatives of the first output y1. We can repeat this for c=e1,,en to compute the derivatives of all the outputs with respect to all the inputs y/xnL×n, ie, the Jacobian matrix. Written in matrix terms, the boundary is the identity matrix InL×nL, and the backpropagation equations are written as follows:

  z¯L =y¯=I  
  z¯l-1 =(z¯lwlT)gl-1(zl-1),l=L,,1  
  x¯ =z¯0  

where z¯lnL×nl. In particular, x¯nL×n is (indeed) the Jacobian matrix y/x. To compute a full Jacobian, the theoretical order of calculations is nL times the vanilla network. However, the implementation of the multiple backpropagation in the matrix form above on a system like TensorFlow automatically benefits from CPU or GPU parallelism. Therefore, the additional computation complexity will be experienced as sublinear.

Higher-order derivatives

The twin network can also predict higher-order derivatives. For simplicity, we revert to the single prediction case where nL=1. The twin network predicts x¯ as a function of the input x. The neural network, however, does not know anything about derivatives. It just computes numbers by a sequence of equations. Hence, we might as well consider the prediction of differentials as multiple outputs.

As previously, in what is now considered a multiple prediction network, we can compute the adjoints of the outputs x¯ in the twin network. These are now the adjoints of the adjoints:

  x¯¯x¯cTxn  

in other words, the Hessian matrix of the value prediction y. The original activation functions must be C2 for this computation. The computation of the full Hessian is of order n times the original network. These additional calculations generate a lot more data, one value, n derivatives and 12n(n+1) second-order derivatives for the cost of 2n times the value prediction alone. In a parallel system like TensorFlow, the experience also remains sublinear. We can extend this argument to arbitrary order q, with the only restriction that the (original) activation functions are Cq.

Conclusion

Throughout our analysis, we have seen that learning the correct shape from pathwise differentials makes a critical difference in the performance of regression models, including neural networks, in computational tasks such as the pricing and risk approximation of complex derivatives trading books. The unreasonable effectiveness of what we call differential ML allows us to reliably train accurate ML models on a small number of simulated examples in real time so they are suitable for online learning. Differential networks apply to real-world problems, including regulations and risk reports with multiple scenarios. Twin networks predict prices and Greeks with almost analytic speed, and their empirical test errors remain of comparable magnitude to those of nested MCs even though nested MCs are orders of magnitude slower to calculate.

Differential training also appears to stabilise the calibration of neural networks. We consistently observed a greatly improved resilience to hyperparameters such as the network architecture, the seeding of weights or the learning-rate schedule. Explaining why is a topic for further research.

Differential machine learning is similar to data augmentation in computer vision, which creates multiple labelled images out of a single image through cropping, zooming, rotating or recolouring. In addition to extending the training set for negligible cost, data augmentation encourages the ML model to learn important invariances. Data augmentation has been immensely successful in computer vision applications. Similarly, derivatives labels not only increase the amount of information in the training set, but also teach the model the shape of the pricing function.

Brian Huge and Antoine Savine are quantitative researchers with Superfly Analytics at Danske Bank.

Email: brian.huge@danskebank.dk,
Email: antoine.savine@danskebank.dk.

References

Only users who have a paid subscription or are part of a corporate subscription are able to print or copy content.

To access these options, along with all other subscription benefits, please contact info@risk.net or view our subscription options here: http://subscriptions.risk.net/subscribe

You are currently unable to copy this content. Please contact info@risk.net to find out more.

You need to sign in to use this feature. If you don’t have a Risk.net account, please register for a trial.

Sign in
You are currently on corporate access.

To use this feature you will need an individual account. If you have one already please sign in.

Sign in.

Alternatively you can request an individual account here