Gaussian mixture model dynamically controlled kernel estimation (GMM-DCKE), a purely data-driven and model-agnostic method to compute conditional expectations, is introduced. Joerg Kienitz applies it to the pricing and hedging of (multi-dimensional) exotic Bermudan options and to calibration and pricing within stochastic local volatility models

Fast and accurate approximations of conditional expectations and their respective distributions are essential for many problems arising in quantitative finance. The least squares Monte Carlo (LSM) method (see Longstaff & Schwartz 2001) is the market standard for simulation-based estimation. It is well known that LSM requires a large number of paths and faces problems in estimating the tails. Tail estimates are crucial, eg, for risk management or pricing deep out-of-the-money options. Convergence depends on the choice of basis functions, but if LSM converges to the price, its derivative might not converge. Variance reduction, basis function selection and local regression have been considered for mitigation.

In this work, we built upon previous work by Geng et al (2022), but the proposed method is different from dynamically controlled kernel estimation (DCKE) since it does not rely on numerical kernel density estimation including (local) bandwidth selection and it does not need to apply Gaussian process regression (GPR) for interpolation or extrapolation and smoothing. The key difference is that it replaces the numerical methods with analytic expressions that lead simultaneously to smooth approximations.

GMM-DCKE allows us to apply further state variables in the sense of control variates (CVs). This improves upon the basic method and allows for analytical computation of proxy hedge sensitivities. Combining the GMM, CVs and properties of multivariate Gaussians, the contributions of this paper to the literature are the following:

• the calculation of model-agnostic, data-driven conditional expectations on observed realisations at given time points even for multi-modal distributions with semi-analytic methods;

• the GMM is fitted numerically, but all subsequent calculations are analytic, and thus they are fast; and

• the method implies closed-form solutions for smooth proxy hedge sensitivities.

For illustration, the method is applied to pricing of (multi-dimensional) Bermudan derivatives and for pricing and calibration with stochastic local volatility models.11 1 A longer exposition and pedagogical illustrations can be found in Kienitz (2021) or at https://github.com/Lapsilago/GMM_DCKE.

To fix notation we take a discrete time grid $\mathcal{T}=\{T_{0},T_{1},\dots,T_{N}\}$ with realisations of $d$ stochastic risk factors $S(T)=(S_{1}(T),\dots,S_{d}(T))^{\mathrm{T}}$, $\smash{T\in\mathcal{T}}$.

### GMM

Let $K\in\mathbb{N}$ be a positive integer and let $Z_{k}$, $k=1,\dots,K$, be $d$-dimensional random variables with $Z_{k}\sim\mathcal{N}_{d}(\mu_{k},\varSigma_{k})$ (denoting a $d$-dimensional Gaussian distribution, with $\mu_{k}\in\mathbb{R}^{d}$ and $\varSigma_{k}\in\mathbb{R}^{d\times d}$ denoting the corresponding mean vectors and covariance matrices, respectively). Let $n_{d}(\cdot;\mu_{k},\varSigma_{k})$ be the corresponding probability density. We consider the $d$-variate $K$-component GMM $Z$ determined by the probability distribution $p_{Z}$:

 $p_{Z}(x)=\sum_{k=1}^{K}\pi_{k}n_{d}(x;\mu_{k},\varSigma_{k}),\quad\pi_{k}>0,~{% }\sum_{k=1}^{K}\pi_{k}=1$

$\pi_{k}$ are called mixing weights and $\operatorname{GMM}(K)$ denotes a GMM with $K$ components. Once $\pi_{k}$, $\mu_{k}$ and $\varSigma_{k}$, $k=1,\dots,K$, are fixed, the conditional distribution can be calculated analytically. Let:

 $Z=(Z_{1},\dots,Z_{m},Z_{m+1},\dots,Z_{m+n=d})^{\mathrm{T}},\qquad Z\sim% \mathcal{N}_{d}(\mu_{Z},\varSigma_{Z})$

We denote the realisations by $X$ and the labels by $Y$. We set $Z=(Y,X)^{\mathrm{T}}$ with $m$-dimensional and $n$-dimensional Gaussians $Y=(Z_{1},\dots,Z_{m})^{\mathrm{T}}$ and $X=(Z_{m+1},\dots,Z_{m+n})^{\mathrm{T}}$, respectively; the conditionals, $Y\mid X$, and marginals for $Y$ and $X$ are available in closed form, with:

 $\mu_{Z}=\begin{pmatrix}\mu_{Y}\\ \mu_{X}\end{pmatrix}\varSigma=\begin{pmatrix}\varSigma_{YY}&\varSigma_{YX}\\ \varSigma_{XY}&\varSigma_{XX}\end{pmatrix}$

Taking $Y$, for instance, and denoting by $p_{z}(y,x)$ the joint density with regard to $Z$, we have:

 $\displaystyle p_{Y}(y)$ $\displaystyle=\int p_{Z}(y,x)\mathrm{d}x\sim\mathcal{N}_{m}(\mu_{Y},\varSigma_% {YY})$ $\displaystyle p_{Y\mid X=x}(y)$ $\displaystyle=\frac{p_{Z}(y,x)}{\int p_{Z}(y,x)\mathrm{d}x}\sim\mathcal{N}_{n}% (\mu_{Y\mid X},\varSigma_{Y\mid X})$

and the means and variances for the conditional distribution are given by:

 $\displaystyle\mu_{Y\mid X}$ $\displaystyle=\mu_{Y}-\varSigma_{YX}\varSigma_{XX}^{-1}(X-\mu_{X})$ (1) $\displaystyle\varSigma_{Y\mid X}$ $\displaystyle=\varSigma_{YY}-\varSigma_{YX}\varSigma_{XX}^{-1}\varSigma_{XY}$ (2)

respectively. Using these results for each single component $k=1,\dots,K$ of $\operatorname{GMM}(K)$ we have:

 $p_{k,Y\mid X=x}(y)=\frac{n_{k}(y,x)}{\int n_{k}(y,x)\mathrm{d}x}\sim\mathcal{N% }_{n}(\mu_{k,Y\mid X},\varSigma_{k,Y\mid X})$

and for the conditional distribution of $\operatorname{GMM}(K)$:

 $\displaystyle p_{Y\mid X}$ $\displaystyle=\sum_{k=1}^{K}\tilde{\pi}_{k}n_{k,Y\mid X}$ (3) $\displaystyle\tilde{\pi}_{k}$ $\displaystyle\sim\frac{\pi_{k}\mathcal{N}_{m}(\mu_{k,X},\varSigma_{k,XX})}{% \sum_{l}\pi_{l}\mathcal{N}_{m}(\mu_{l,X},\varSigma_{l,XX})}$ (4)

For a given $\operatorname{GMM}(K)$, the conditional model is determined by (3) and (4). The key observation is that all conditional expectations, variances and covariances can be calculated analytically using (1) and (2).

### The parameters

Let the set $\mathcal{X}$ be of size $N$, and for $x\in\mathcal{X}$ we assume $x\in\mathbb{R}^{d}$. Then, $\mathcal{X}$ is represented by an $N\times d$ matrix. Assume the number of components $K$ for $\operatorname{GMM}(K)$ is given. If $\theta$ denotes the collection of all parameters $\pi_{k}$, $\mu_{k}$ and $\varSigma_{k}$, $k=1,\dots,K$, then $\theta$ is obtained by maximising the (log)likelihood:

 $l(\theta)=\sum_{n=1}^{N}\log\bigg{(}\sum_{k=1}^{K}\pi_{k}n_{d}(x_{n},\mu_{k},% \varSigma_{k})\bigg{)}$

by expectation maximisation (EM) (Dempster et al 1977).

### GMM-DCKE algorithm

#### Input

##### Training set:

$\mathcal{X}=(x_{1},\dots,x_{N})$, $x_{n}\in\mathbb{R}^{d}$, with labels $\mathcal{Y}=(y_{1},\dots,y_{N})$, $y_{n}\in\mathbb{R}$. $(\mathcal{X},\mathcal{Y})$ are joint realisations of some random variables $X$$Y$. The underlying risk factors are denoted by $X$, and $Y$ is a function of $X$, eg, a payoff at maturity $T$ for Bermudan derivatives or a variance value for a stochastic local volatility model.

##### Test set:

Let $\mathcal{X}^{*}=(x^{*}_{1},\dots,x^{*}_{M})$, $x^{*}_{n}\in\mathbb{R}^{d}$, be a test set.

#### Output

The outputs are predictions $y^{*}=(y^{*}_{1},\dots,y^{*}_{M})$, $y^{*}_{j}\in\mathbb{R}$, $y^{*}_{j}\approx\mathbb{E}[Y\mid X=x^{*}_{j}]$. For our examples, these are the conditional expected values of a Bermudan derivative, or the expected variance values given the value of the risk factor in a stochastic volatility model. For in-sample performance we take $\smash{\mathcal{X}^{*}=\mathcal{X}}$.

#### Parameters

The number of components is given by $K$ and, if we use a control variate, its realisations $\mathcal{Z}=(Z_{1},\dots,Z_{N})$. Including $\mathcal{Z}$ into the fitting of $\operatorname{GMM}(K)$, quantities such as the conditional mean, given by $\mu_{Z}(x):=\mathbb{E}[Z\mid X=x]$, are calculated analytically.

#### Calculation

The first (numeric) step is to fit $\operatorname{GMM}(K)$ to approximate the joint distribution of $\mathcal{X}$, $\mathcal{Y}$ and $\mathcal{Z}$ using EM. The next steps are analytic, using (1) and (2). To apply the CVs in order to stabilise the model we consider:

 $\displaystyle Y^{*}:=Y\mid X+\beta_{X}(Z\mid X-\mu_{Z\mid X})$ (5)

with $\beta_{X}$ minimising the variance, (5):

 $\displaystyle\beta_{X=x}:=-\frac{\operatorname{Cov}[Y,Z\mid X=x]}{% \operatorname{Var}[Z\mid X=x]}$ (6)

For $x\in\mathcal{X}$, with $\mu_{Z\mid X=x}$ a conditional mean, the $\beta_{X=x}\in\mathbb{R}^{d}$ are calculated analytically as a quotient of conditional covariance and conditional variance, respectively, for Gaussian distributions.

The quotient $\beta_{X}$ can be seen as a (conditional) trading strategy with respect to $Z$, given the values $X$ that minimise the variance for $Y^{*}$. If $Z$ is one of the risk factors, it is the (time-discrete) conditional minimal-variance delta, and thus a time-discrete, model-agnostic, data-driven hedge can be calculated analytically. Other choices, depending on the payoff, may be used to improve the estimate, eg, the maximum or minimum of the assets, when considering ‘best-of’ or ‘rainbow’ options.

### Components and enhancements

The number $K$ of components can be determined by empirical results or by statistical methods, eg, a Bayesian Gaussian mixture, silhouette scores or minimising information criteria such as the Akaike and Bayesian information criteria:

 $\mathrm{AIC}:=2K-2\ln(\hat{L});\quad\mathrm{BIC}=K\ln(N)-2\ln(\hat{L})$

respectively, where $\smash{\hat{L}:=p(x\mid\hat{\theta},M)}$, with $\smash{\hat{\theta}}$ the parameters maximising the likelihood, $x\in\mathcal{X}$ and $N$ the number data points. Taking the model where the AIC (or BIC) is smallest may lead to overfitting. Discrete gradients, ie, differences in AIC and BIC, for $\operatorname{GMM}(i)$ and $\operatorname{GMM}(i+1)$, respectively, can be considered. The number of components is chosen once this difference does not significantly increase further.

Silhouette scores exploit the compactness and separability of the clusters implied by the GMM. The Jensen-Shannon metric (JS) for $\operatorname{GMM}(i)$ and $\operatorname{GMM}(i+1)$ is calculated, and the number $i$ for which we observe a sudden jump in the JS value is chosen.

Known limitations are that the approximation is only valid for sample sizes $N$ that are much larger than the number of parameters in the model.

A method for enhancement is ‘bagging’, an ensemble method that consists of bootstrapping and aggregation (see Bishop 2006). In our setting we perform GMM-DCKE $k_{\mathrm{bag}}$ times with $N_{\mathrm{bag}}$ observations randomly sampled from $\mathcal{X}$ with replacement and averaged over the number of experiments. Transformations can also be used. For instance, keeping values, especially of option prices or implied volatilities, positive is essential. (When applying Gaussian distributions, negative values are not ruled out.) If we have a set of strictly positive values $\mathcal{A}$, we instead work with the set $\mathcal{A}_{\varepsilon,\mathrm{log}}$ obtained by applying the transform $\log(a+\varepsilon)$, $\varepsilon>0$, for each $a\in\mathcal{A}$. Applying the GMM to this new training set and using the inverse transform $\exp(a-\varepsilon)$ leads to strictly positive values.

## Comparison with other approaches

Our work is related to that of Geng et al (2022), which introduced a kernel density estimator with a (local) bandwidth used to approximate the conditional expected values and Gaussian process regression used for smoothing and interpolation or extrapolation. Selecting a local bandwidth may be slow and depends strongly on the dimensionality. In contrast, our method is semi-analytic, stable and only depends on one parameter: the number of Gaussian regressions, which needs to be determined.

The work is also related to that of Halperin (2017), which presents a discrete-time option pricing model that is rooted in reinforcement learning. Our method essentially gives more direct access to this approach, circumventing the reinforcement learning part. Huge & Savine (2020) train a neural network, taking the hedge, ie, the sensitivities, into account to stabilise the calculations. Techniques such as kernel density estimation in multiple dimensions, reinforcement learning and neural networks are data and computationally intense when required to perform with high accuracy. GMM-DCKE derives the option price, conditional expectations and hedges accurately while applying sparse parameter sets and has significant advantages with regard to both the amount of data and the computation time, especially when considering high-dimensional problems.

## Bermudan derivatives

For pricing Bermudan derivatives, for each consecutive pair $T_{i}$, $T_{i+1}\in\mathcal{T}$, we fit a GMM to the observed realisations and consider a discrete proxy hedging approach. Recall that if $S_{t}$ is the risky underlying, $r$ the risk-free rate and $B_{t}$ the bank account, then the self-financing condition of any hedging strategy $u_{t}$ implies that, for any time interval $[t,t+\Delta]$:

 $u_{t}S_{t+\Delta}+\mathrm{e}^{r\Delta}B_{t}=u_{t+\Delta}S_{t+\Delta}+B_{t+\Delta}$

Denote by $\varPi_{t}$ the replication portfolio of an option using a hedging strategy $u_{t}$ with respect to the available information, ie, the filtration $\mathcal{F}_{t}$. The hedging strategy $u_{t}^{*}$ that minimises the variance $\operatorname{Var}[\varPi_{T}\mid\mathcal{F}_{t}]$ is given by:

 $\frac{\operatorname{Cov}[\varPi_{t+1},\Delta S_{t}\mid\mathcal{F}_{t}]}{% \operatorname{Var}[\Delta S_{t}\mid\mathcal{F}_{t}]}$

The computation of the latter can be computationally intense. Notice that the minimal-variance hedging strategy $u_{t}^{*}$ is merely the minimal-variance delta. Taking this as a proxy, we include it in the calculation of conditional values to reduce the conditional variances. With GMM-DCKE the variance minimiser is available in closed form.

For pricing, we need to approximate the continuation value at each possible exercise date. This involves computing a conditional expectation. To illustrate our results we focus on a given time point $t$. For the full valuation this has to be applied in a backward algorithm. We choose the rough Bergomi and Bates models for illustration (Bates 1996; Bayer et al 2016).22 2 We have actually applied the method to diffusion, jump-diffusion and pure jump processes. Let $W_{1}(t)$, $W_{2}(t)$ be Brownian motions with correlation coefficient $\rho\in[-1,1]$. Take $S_{0}=s_{0}$, $V_{0}=v_{0}$ as the initial values for the asset price and the variance, respectively. The stochastic differential equations (SDEs) for the evolution of the Bates model are:

 \left.\begin{aligned} \displaystyle\frac{\mathrm{d}S_{t}}{S_{t}}&\displaystyle% =r\mathrm{d}t+\sqrt{V_{t}}\mathrm{d}W_{t}+(Y-1)\mathrm{d}N(t)\\ \displaystyle\mathrm{d}V_{t}&\displaystyle=\kappa(\theta-V_{t})\mathrm{d}t+\nu% \sqrt{V_{t}}\mathrm{d}Z_{t}\end{aligned}\right\} (7)

The parameters in (7) are $r$, the risk-neutral rate; $\kappa$, the mean reversion of the variance; $\theta$, the long-term variance; and $\nu$, the volatility of variance. For a Poisson process $N$ with intensity $\lambda\geq 0$, $Y\sim\mathcal{N}_{1}(\mu_{j},\sigma_{j})$, $\mu_{j}\in\mathbb{R}$ and $\sigma_{j}\in\mathbb{R}^{+}$.

For the rough Bergomi model we take the logarithmic price process $X_{t}:=\log S_{t}$, $X_{0}=\log(s_{0})=:x_{0}$. To model the instantaneous variance we consider the instantaneous forward variance $\xi_{t}^{u}$, $u\geq t$, for date $u$ observed at $t$. Let $\alpha=H-\frac{1}{2}\in(-\frac{1}{2},0]$ be a negative exponent depending on the Hurst exponent $H\in(0,\frac{1}{2}]$. For $\eta>0$, we have $X_{0}=x_{0}$ and $\xi_{0}^{t}=0$. The rough Bergomi model is given by:

 \left.\begin{aligned} \displaystyle\mathrm{d}X_{t}&\displaystyle=-\tfrac{1}{2}% V_{t}\mathrm{d}t+\sqrt{V_{t}}\mathrm{d}W_{1}(t)\\ \displaystyle\mathrm{d}\xi_{t}^{u}&\displaystyle=\xi_{t}^{u}\eta\sqrt{2\alpha+% 1}(u-t)^{\alpha}\mathrm{d}W_{2}(t)\end{aligned}\right\} (8)

with the variance obtained from (8) by setting:

 $V_{t}=\xi\exp(\eta Y-\tfrac{1}{2}\eta^{2}t^{2\alpha+1})$

with $Y$ being the corresponding realisation of a Volterra process. We assume that a payoff $h$ is given and approximate the corresponding option price at time $t$.

We set the parameters as follows: $s_{0}=100$, $r=0.01$, $v_{0}=0.3^{2}$, $\theta=0.32^{2}$, $\kappa=0.2$, $\nu=0.3$ and $\rho=-0.5$, $\lambda=0.2$, $\mu_{j}=0$ and $\sigma_{j}=0.25$ for the Bates model. For the rough Bergomi model we take $a=-0.43$, $\xi=0.235^{2}$, $\eta=0.9$ and $\smash{\rho=-0.5}$.

### Single underlying

We consider a call option and its delta for the Bates and rough Bergomi models. There is no obvious choice of $K$, since the number of components depends on the shape of the distribution, eg, skewness or kurtosis. Our numerical experiments have shown that the chosen number of Gaussian regressions ($3$ and $5$, respectively) lead to accurate results. In practice we recommend validating this assumption using cross validation or information criteria.

For numerical illustration we consider time $t=0.5$ and plot the conditional expectations for the calls and their deltas, together with quantiles $q\%$ and $(1-q)\%$, respectively ($q=0.5,1,5,10$), and compare them with both semi-analytical and nested Monte Carlo results (see figure 1). The absolute difference to these values, which serves as a benchmark, is only a few basis points and is much smaller than for the LSM. Some discrepancy is observed with respect to the proxy hedge, but this is due to the size of the time interval. The hedge still compares well to the instantaneously balanced hedge. Values of $t$ closer to $T$ lead to better results and improve upon the results stated in Geng et al (2022). Since we choose the underlying as the control variate, we derive the time-discrete minimal-variance delta analytically. All the simulations were done using 10,000 paths.

Let us summarise some of the findings. First, we observe that analytic smooth option prices and hedge sensitivities can be obtained using GMM-DCKE in a model-agnostic and data-driven way. Second, we observe the impact of the control variate; essentially, by including the proxy, we control the tail behaviour of the estimator. Thus, depending on the choice of underlying, we add linear tail behaviour, which means a slope of $0$ for the left tail and $1$ for the right tail, since our example is a call option. Compared with those in Geng et al (2022), the resulting hedges are smoother and more accurate, even for large time steps, eg, half-yearly ($t=0.5$). The difference from the displayed hedge sensitivities calculated using the model specifics is due to the numerical fit of the $\operatorname{GMM}(K)$ and because we consider time-discrete settings and not a continuous setting.

We are able to compute hedge sensitivities for the rough Bergomi model without nested Monte Carlo simulation.

For interest rate derivatives we consider the pricing of an at-the-money Bermudan swaption with the Cheyette model with and without stochastic volatility. Taking $x_{0}=y_{0}=0$ and $z_{0}=1$, the model is given by:

 \left.\begin{aligned} \displaystyle\mathrm{d}x_{t}&\displaystyle=(y_{t}-\kappa x% _{t})\mathrm{d}t+\eta(t)\mathrm{d}Z_{1}(t)\\ \displaystyle\mathrm{d}y_{t}&\displaystyle=(\eta(t)^{2}-2\kappa y_{t})\mathrm{% d}t\\ \displaystyle\mathrm{d}z_{t}&\displaystyle=\beta(\theta-z_{t})\mathrm{d}t+% \varepsilon\sqrt{z_{t}}\mathrm{d}Z_{2}(t)\\ \displaystyle\eta(t)&\displaystyle=\sigma_{0}\sqrt{z_{t}}(mx_{t}-(1-m)s_{0})% \end{aligned}\right\} (9)

with $Z_{i}$, $i=1,2$, being uncorrelated Brownian motions. We take $\kappa=0.03$, $\beta=0.4$, $\theta=1$, $\varepsilon=0.8$, $\sigma_{0}=0.2$, $m=0.15$ and $s_{0}=0.043011$. Using an alternating direction implicit (ADI) implementation with a grid of 200 points for the time component and 100 points for each space component. For the prices (yearly exercise), 5-year option maturity and 15-years underlying maturity, we obtain 572.4052 basis points (respectively, 552.2217bp) for the two-dimensional (2D) (respectively, three-dimensional (3D)) case. GMM-DCKE prices the options accordingly, with a maximum absolute difference of 1.284bp using 10,000 simulations. Both Python-based implementations were done on a Windows i9 machine with 16GB RAM and four cores using Python 3.6 with non-vectorised code. The runtimes are summarised in table A.

GMM-DCKE can be considered as an alternative to other numerical methods for pricing interest rate derivatives, especially for more than two dimensions.

### Multiple underlyings

We illustrate the performance of GMM-DCKE using a five-dimensional Heston model with asset correlation matrix $C_{\mathrm{aa}}$ and asset-variance correlation matrix $C_{\mathrm{av}}$ (compared to a full positive definite matrix $C$) defined, respectively, by:

 $\displaystyle C_{\mathrm{aa}}$ $\displaystyle=\begin{pmatrix}1.0&0.2&0.0&0.5&0.7\\ 0.2&1.0&0.4&0.0&0.1\\ 0.0&0.4&1.0&0.3&0.2\\ 0.5&0.0&0.3&1.0&0.25\\ 0.7&0.1&0.2&0.25&1.0\end{pmatrix}$ $\displaystyle C_{\mathrm{av}}$ $\displaystyle=\operatorname{diag}(-0.7,-0.8,-0.9,-0.8,-0.3)^{\mathrm{T}}$

For the numerical experiments we show the performance of GMM-DCKE on the training set with 10,000 paths as well as using 500 paths for validation and we compare it to the DCKE method (Geng et al 2022). As an example, we consider a rainbow option with payoff given by:

 $\displaystyle h_{t}^{\mathrm{rainbow}}$ $\displaystyle=\bigg{(}\max\bigg{(}\frac{S_{1,t}}{S_{1,t_{0}}},\dots,\frac{S_{d% ,t}}{S_{d,t_{0}}}\bigg{)}-K\bigg{)}^{\!+}$ (10)

We take $S_{u,0}=100$, $r=0.01$, $K=0.9$, $u=1,\dots,5$. For the $d$-dimensional setting we first derive $d$ $\operatorname{GMM}(K)$ representations for the random vectors $(S_{i,T},O_{T},S_{t})$, $i=1,\dots,d$, with $O_{T}$ being the option values at time $T$, $S_{t}\in\mathbb{R}^{d}$ being the asset prices at time $t$ and $S_{i,T}$ being the $i$th asset value, which serves as a control variate. We form the conditional distribution for each dimension on every $\operatorname{GMM}(K)$ component and calculate the conditional delta $\delta_{k,i}$ as well as conditional mean estimates $\mu_{k,i}$ for each $x_{i}$. The control variate is the weighted sum of the individual deltas estimated. For all samples $x_{i}$ we calculate the control variate using $\smash{\sum_{k=1}^{K}\delta_{k,i}(x_{i}-m_{k,i})}$. As can be seen from figure 2, the results of GMM-DCKE and DCKE are close, and this is also observed for other payoffs, such as for basket options.

Unlike DCKE, GMM-DCKE is based fully on the initial training set. To apply DCKE in multiple dimensions, the hedge sensitivities are computed by differentiating the payoff. Thus, DCKE depends on knowing the payoff function, and it is assumed that the corresponding derivatives can be calculated. With the hedge sensitivities calculated in this way, DCKE is not fully model-agnostic. With this additional adjustment it leads to results that are a little more accurate than those of GMM-DCKE. Fitting the full $d$-dimensional distribution can also be done, but it is computationally more intense.

The effect of using different control variates (ie, single assets, the max/min of all assets) is shown in figure 3. As expected, taking the $\max$ as a control variate leads to the best results.

## Stochastic local volatility

For pricing and calibrating stochastic local volatility models, several techniques (ie, particle methods, binning and logarithmic normal approximations) have been considered (see Guyon & Henry-Labordère 2012; Muguruza 2020; van der Stoep et al 2014). We consider the dynamics of the forward, $S_{t}$, given by:

 \left.\begin{aligned} \displaystyle\frac{\mathrm{d}S_{t}}{S_{t}}&\displaystyle% =\lambda(S_{t},t)\psi(V_{t})\mathrm{d}W_{t}\\ \displaystyle\mathrm{d}V_{t}&\displaystyle=\mu(t,V_{t})\mathrm{d}t+\sigma_{V}(% t,V_{t})\mathrm{d}Z_{t}\end{aligned}\right\} (11)

where $S_{0}=s_{0}$, $V_{0}=v_{0}$, $\mu$ and $\sigma_{V}$ are the drift and volatility functions, respectively, of the instantaneous variance, $\lambda\colon\mathbb{R}_{+}\times\mathbb{R}\rightarrow\mathbb{R}_{+}$ is the leverage function and $W$ and $Z$ are one-dimensional correlated Brownian motions. The function $\psi$ represents the stochastic volatility component. For the Heston model we have $\psi(X)=\sqrt{X}$. The SDE (11) can be reframed into a nonlinear equation in the sense of McKean and Vlasov, with the volatility depending on the probability distributions of $S$ and $V$. Showing the existence and uniqueness of the solutions to such equations is an involved mathematical problem, and for certain parameters, eg, for the stochastic volatility component, large values of the volatility of variance, there may not be any solution (see Guyon & Henry-Labordère 2012).

The market standard method for calibrating the model is to use a previously calibrated local volatility, $\sigma_{D}\colon\mathbb{R}^{+}\times\mathbb{R}^{+}\rightarrow\mathbb{R}^{+}$ (Dupire 1994), and a calibrated stochastic volatility model. The leverage function from (11) is derived by Markovian projection, and can be expressed as follows (see Guyon & Henry-Labordère 2012; van der Stoep et al 2014):

 $\lambda^{2}(t,s)=\frac{\sigma_{D}(t,s)^{2}}{\mathbb{E}[\psi(V_{t})^{2}\mid S_{% t}=s]}$ (12)

The conditional expectation appearing in (12) needs to be calculated. This is approximated by GMM-DCKE. For calibration we take $T_{1},T_{2},\dots,T_{N}\in\mathcal{T}$ and consider:

• $T_{1}$: no fit; use the initial value $\psi(V_{t})$.

• $T_{2}$: fit the $\operatorname{GMM}(K)$ model to the $T_{1}$ and $T_{2}$ values and store the $\operatorname{GMM}(K)$ parameters as $\theta_{1}$.

• $T_{n}$: fit the $\operatorname{GMM}(K)$ model to the $T_{n-1}$ and $T_{n}$ values using $\theta_{n-1}$ as the initial values.

Using the most recently fitted parameters, we observed that the convergence can be improved. This relies (as do the other methods) on the input of the local volatility function $\sigma_{D}$. For widely fluctuating $\sigma_{D}$, it may be necessary to fully recalibrate at each step.

Figure 4 shows the accuracy of the method and also considers the version with bagging using only 20 runs with 10% (ie, 1,000) of samples, leading to speed-ups of 2–5 times and increased accuracy. Part (a) of the figure illustrates the application of GMM-DCKE to the rough Bergomi model for several choices of $K$. Part (b) shows the same model but with bagging applied. In our numerical experiments it turned out choosing $K=3$ is reasonable. This is also confirmed by taking the AIC and BIC. We also plotted the 1% and 99% quantiles for Gaussians versus binning, the 0.5%, 1%, 5% and 10% quantiles for the model with bagging and the 99.5%, 99%, 95% and 90% quantiles for the bagging case, respectively, to further illustrate the accuracy, even for the tails.

In terms of runtime we compared our method with the particle method, which is widely applied in financial institutions, and we observed our method performs slightly better when the parameter fitting is done using the previously calculated parameters and if we store the inverses of the matrices used for calculating the control variates (see (1), (2)). The particle method is often applied in conjunction with the Silverman rule, which allows a bandwidth to be picked for the kernel. Distributions that are very different from the Gaussian cannot be handled well with the assumptions made when applying this rule. A more sophisticated choice of the bandwidth (eg, a local bandwidth) may then be necessary. Determining the latter leads to optimisation problems that increase the runtime of the particle method.

For a Heston local volatility model we investigated a stressed parameter for a volatility of variance of 2.5 and observe that the particle and the method of Muguruza (2020) underestimate the smile. The binning method and GMM-DCKE, using $\operatorname{GMM}(3)$, lead to reasonable results (see figure 4(b)).

Finally, we have considered the same example as in Muguruza (2020). Since exactly the same data were not available, we calibrated to the data we observed and plotted a Heston smile that is close to the one shown in Muguruza (2020). We observe that applying GMM-DCKE for calculating the conditional expectations and determining the implied volatility leads to nearly identical results. The largest difference was less than 4.3bp for implied lognormal volatility, with an average error of 2.1bp. Furthermore, since GMM-DCKE does not assume a certain model, it can be applied to the more complex models considered in Muguruza (2020).

## Conclusions and summary

GMM-DCKE is a purely data-driven and model-agnostic method to compute conditional expectations. We applied it to the pricing of (multi-dimensional) Bermudan options and the pricing and calibration of stochastic local volatility models. It also leads to hedging strategies with the realisations at given time points but without the need to know the underlying stochastic model. GMM-DCKE also can also be used for computing control variates for standard Monte Carlo simulation in multiple dimensions, where finding appropriate controls that can be computed fast is often difficult. In contrast to the method in Geng et al (2022), it is independent of the choice of (local) bandwidth for kernel density estimation. GMM-DCKE can be integrated into existing set-ups as an alternative to using other numerical methods. Due to its analytical nature and application of the GMM it only needs a decent amount of training data. Relevant practical examples show that its performance is at least comparable with existing numerical methods and can be enhanced by careful implementation of the method using vectorisation or efficient EM algorithms. Using methods such as bagging improves upon the suggested model. Further ideas – for instance, applying autoencoders with GMM-DCKE in low-dimensional latent space or using different distributions – are the subject of ongoing research.

Jörg Kienitz is an adjunct associate professor at the University of Cape Town and the University of Wuppertal. He owns finciraptor.de and is a partner in the quantitative services unit of Acadia. He is based in Bonn. He thanks Gordon Lee and Thomas McWalter for fruitful discussions, and two anonymous referees who helped to greatly improve the paper.

## References

• Bates D, 1996
Jumps and stochastic volatility, exchange rate processes implicit in Deutsche mark options
Review of Financial Studies 9, pages 69–107
• Bayer C, P Friz and J Gatheral, 2016
Pricing under rough volatility
Quantitative Finance 16(6), pages 887–904
• Bishop CM, 2006
Pattern Recognition and Machine Learning
Springer
• Dempster AP, NM Larid and DB Rubin, 1977
Maximum likelihood from incomplete data via the EM algorithm
Journal of the Royal Statistical Society B 39(1), pages 1–38
• Dupire B, 1994
Pricing with a smile
Risk July, pages 18–20
• Geng Q, J Kienitz, GT Lee and N Nowaczyk, 2022
Dynamically controlled kernel estimation
Risk February, pages 110–115, http://www.risk.net/7921186
• Guyon J and P Henry-Labordère, 2012
Risk January, pages 88–93, http://www.risk.net/2135540
• Halperin I, 2017
QLBS: Q-learner in the Black-Scholes (-Merton) worlds
Preprint, SSRN, December (https://ssrn.com/abstract=3087076)
• Huge B and A Savine, 2020
Differential machine learning: the shape of things to come
Risk October, pages 76–81, http://www.risk.net/7688441
• Kienitz, J, 2021
GMM-DCKE: semi-analytic conditional expectations
Preprint, SSRN, August (https://dx.doi.org/10.2139/ssrn.3902490)
• Longstaff FA and ES Schwartz, 2001
Valuing American options by simulation: a simple least-squares approach
Review of Financial Studies 14(1), pages 113–147
• Muguruza A, 2020
Not so particular about calibration: smile problem resolved
Preprint, SSRN (https://ssrn.com/abstract=3461545)
• van der Stoep AW, LA Grzelak and CW Oosterlee, 2014
The Heston stochastic-local volatility model: efficient Monte Carlo simulation
International Journal of Theoretical and Applied Finance 17(7), article 1450045

Only users who have a paid subscription or are part of a corporate subscription are able to print or copy content.

To access these options, along with all other subscription benefits, please contact [email protected] or view our subscription options here: http://subscriptions.risk.net/subscribe