# Journal of Computational Finance # A review of tree-based approaches to solving forward–backward stochastic differential equations

## Long Teng

#### Need to know

• A regression-tree-based approach is proposed to solve numerically high-dimensional backward stochastic differential equations (BSDEs).
• The approach is reviewed from different perspectives, and error analysis is performed.
• Several numerical experiments including high-dimensional problems are provided to demonstrate accuracy and performance.
• For the applicability of BSDEs in financial problems, the Heston stochastic volatility model and a high-dimensional nonlinear pricing problem are solved with the presented approach via BSDEs.

#### Abstract

In this work, we study ways of solving (decoupled) forward–backward stochastic differential equations numerically using regression trees. Based on general theta-discretization for time integrands, we show how to efficiently use regression-tree-based methods to solve the resulting conditional expectations. Several numerical experiments, including high-dimensional problems, are provided to demonstrate accuracy and performance. To show the applicability of forward–backward stochastic differential equations to financial problems, we apply our tree-based approach to the Heston stochastic volatility model to high-dimensional nonlinear pricing problems.

## 1 Introduction

It is well known that many problems (eg, pricing and hedging) in the field of financial mathematics can be represented in terms of (forward–)backward stochastic differential equations ((F)BSDEs). These make problems easier to solve but usually have no analytical solution (see, for example, Karoui et al 1997a). The general form of (decoupled) FBSDEs reads

 \left.\begin{aligned} \displaystyle\mathrm{d}X_{t}&\displaystyle=a(t,X_{t})% \mathrm{d}t+b(t,X_{t})\mathrm{d}W_{t},\quad X_{0}=x_{0},\\ \displaystyle-\mathrm{d}Y_{t}&\displaystyle=f(t,X_{t},Y_{t},Z_{t})\mathrm{d}t-% Z_{t}\mathrm{d}W_{t},\\ \displaystyle Y_{T}&\displaystyle=\xi=g(X_{T}),\end{aligned}\right\} (1.1)

where $X_{t},a\in\mathbb{R}^{n}$, $b$ is an $n\times d$ matrix, $f(t,X_{t},Y_{t},Z_{t})\colon[0,T]\times\mathbb{R}^{n}\times\mathbb{R}^{m}% \times\mathbb{R}^{m\times d}\to\mathbb{R}^{m}$ is the driver function and $\xi$ is the square-integrable terminal condition.

The existence and uniqueness of solutions to such equations, under the Lipschitz conditions on $f$, $a(t,X_{t})$, $b(t,X_{t})$ and $g$, were proved by Pardoux and Peng (1990). Since then, many works have tried to relax this condition: for example, the uniqueness of the solution was extended under more general assumptions for $f$ in Lepeltier and Martin (1997), but only in the one-dimensional case.

In recent years, many numerical methods have been proposed for coupled and decoupled FBSDEs. For the numerical algorithms with (least-squares) Monte Carlo approaches we refer the reader to Bender and Steiner (2012), Bouchard and Touzi (2004), Gobet et al (2005), Lemor et al (2006) and Zhao et al (2006); the multilevel Monte Carlo method based on Picard approximation can be found in E et al (2019). There exists a connection between BSDEs and partial differential equations (PDEs) (see Karoui et al 1997b; Peng 1991); some numerical schemes with the aid of this connection can be found in, for example, Ma et al (1994). For the deep-learning-based numerical method we refer to E et al (2017). The approach based on the Fourier method for BSDEs is developed in Ruijter and Oosterlee (2015). See also Zhao et al (2010) and Teng et al (2020) for the multi-step scheme.

In Crisan and Manolarakis (2010) an algorithm with the cubature method is proposed to solve FBSDEs. The cubature method is an approximation method for the law of the solution of an SDE that generates a tree on which expectations and conditional expectations are evaluated. By introducing a minimal variance sampling method for the support of the cubature measure, Crisan and Manolarakis address the issue that the tree methods suffer from an exponential increase in the size of the support of the approximating measure. A method relying on an approximation of Brownian motion by a simple random walk is proposed in Ma et al (2002) for BSDEs: that is, the conditional expectations are computed using a tree structure, in which the final condition is allowed to be quite general. In this paper, we show how to efficiently use regression-tree-based approaches to find accurate approximations of FBSDEs (1.1). We apply the general theta-discretization method for the time integrands, and we approximate the resulting conditional expectations using the regression-tree-based approach with sample-splitting techniques. More precisely, we deal with the projection of samples via trees to overcome the curse of dimensionality. Several numerical experiments of different types including 100-dimensional problems are performed to demonstrate our findings.

In the next section we give some notation and definitions, and in Section 3 we introduce the generalized theta method. Section 4 is devoted to how to use the regression-tree-based approaches to approximate conditional expectations. In Section 5 several numerical experiments on different types of FBSDEs, including financial applications, are provided to show the accuracy and applicability for high-dimensional nonlinear problems. Section 6 presents our conclusions.

## 2 Preliminaries

Throughout the paper, we assume that $(\varOmega,\mathcal{F},P;\{\mathcal{F}_{t}\}_{0\leq t\leq T})$ is a complete, filtered probability space. In this space, a standard $d$-dimensional Brownian motion $W_{t}$ with a finite terminal time $T$ is defined, which generates the filtration $\{\mathcal{F}_{t}\}_{0\leq t\leq T}$, that is, $\mathcal{F}_{t}=\sigma\{W_{s},0\leq s\leq t\}$, for FBSDEs. The usual hypotheses should be satisfied. We denote the set of all $\mathcal{F}_{t}$-adapted and square integrable processes in $\mathbb{R}^{d}$ by $L^{2}=L^{2}(0,T;\mathbb{R}^{d})$. A pair of processes $(Y_{t},Z_{t})$ is the solution to the FBSDEs in (1.1) if it is $\mathcal{F}_{t}$-adapted and square integrable and satisfies (1.1) as

 $Y_{t}=\xi+\int_{t}^{T}f(s,(X_{s}),Y_{s},Z_{s})\mathrm{d}s-\int_{t}^{T}Z_{s}% \mathrm{d}W_{s},\quad t\in[0,T],$ (2.1)

where $f(t,\mathbb{X}_{s}):=f(t,(X_{s}),Y_{s},Z_{s})\colon[0,T]~{}(\times\mathbb{R}^{% n})\times\mathbb{R}^{m}\times\mathbb{R}^{m\times d}\to\mathbb{R}^{m}$ is $\mathcal{F}_{t}$-adapted and $\xi=g(X_{T})\colon\mathbb{R}^{n}\to\mathbb{R}^{m}$.

Suppose that the terminal value $Y_{T}$ is of the form $g(X^{t,x}_{T})$, where $X^{t,x}_{T}$ denotes the solution of $\mathrm{d}X_{t}$ in (1.1) starting from $x$ at time $t$. Then the solution $(Y^{t,x}_{t},Z^{t,x}_{t})$ of FBSDEs in (1.1) can be represented (Karoui et al 1997b; Ma and Zhang 2005; Peng 1991) as

 $Y^{t,x}_{t}=u(t,x),\quad Z^{t,x}_{t}=(\nabla u(t,x))b(t,x)\quad\forall t\in[0,% T),$ (2.2)

which is the solution to the semi-linear parabolic PDE of the form

 $\frac{\partial u}{\partial t}+\sum_{i}^{n}a_{i}\partial_{i}u+\frac{1}{2}\sum_{% i,j}^{n}(bb^{\mathrm{T}})_{i,j}\partial^{2}_{i,j}u+f(t,x,u,(\nabla u)b)=0,$ (2.3)

with the terminal condition $u(T,x)=g(x)$. In turn, suppose $(Y,Z)$ is the solution to FBSDEs, $u(t,x)=Y^{t,x}_{t}$ is a viscosity solution to the PDEs.

## 3 Discretization of the forward–backward stochastic differential equation using the theta method

We introduce the time partition for the time interval $[0,T]$:

 $\Delta_{t}=\{t_{i}\mid t_{i}\in[0,T],~{}i=0,1,\dots,N_{T},~{}t_{i} (3.1)

Let $\Delta t_{i}=t_{i+1}-t_{i}$ be the time step, and denote the maximum time step by $\Delta t$. For the FBSDEs, we need to additionally discretize the forward SDE in (1.1):

 $X_{t}=x_{0}+\int_{0}^{t}a(s,X_{s})\mathrm{d}s+\int_{0}^{t}b(s,X_{s})\mathrm{d}% W_{s}.$ (3.2)

Suppose that the forward SDE in (3.2) can already be discretized by a process $X^{\Delta_{t}}_{t_{i}}$ such that

 $E[\max_{t_{i}}|X_{t_{i}}-X^{\Delta_{t}}_{t_{i}}|^{2}]=\mathcal{O}({\Delta_{t}}),$ (3.3)

which means strong mean square convergence of order $\frac{1}{2}$. For the backward process (2.1), the well-known generalized $\theta$-discretization reads

 $\displaystyle Y_{N_{T}}^{\Delta_{t}}=g(X_{N_{T}}^{\Delta_{t}}),\qquad Z_{N_{T}% }^{\Delta_{t}}=g_{x}(X_{N_{T}}^{\Delta_{t}}).$ (3.4)

For $i=N_{T}-1,\dots,0$, we have

 $\displaystyle-E_{i}[Y_{i+1}\Delta W_{i+1}]$ $\displaystyle=\Delta t_{i}(1-\theta_{1})E_{i}[f(t_{i+1},\mathbb{X}_{i+1})% \Delta W_{i+1}]-\Delta t_{i}\theta_{2}Z_{i}$ $\displaystyle\qquad-\Delta t_{i}(1-\theta_{2})E_{i}[Z_{i+1}]+R_{\theta}^{Z_{i}},$ (3.5) $\displaystyle Y_{i}$ $\displaystyle=E_{i}[Y_{i+1}]+\Delta{t_{i}}\theta_{3}f(t_{i},\mathbb{X}_{i})$ $\displaystyle\qquad+\Delta{t_{i}}(1-\theta_{3})E_{i}[f(t_{i+1},\mathbb{X}_{i+1% })]+R_{\theta}^{Y_{i}},$ (3.6)

where $R_{\theta}^{Y_{i}}$ and $R_{\theta}^{Z_{i}}$ denote the discretization errors (see, for example, Li et al 2017; Zhao et al 2012, 2014). Obviously, by choosing the different values for $\theta_{l}\in[0,1]$, $l=1,2,3$, we can obtain different schemes. Note that $\theta_{2}$ cannot be set to zero.

## 4 Computation of conditional expectations with the tree-based approach

In this section we show how to innovatively use the tree-based approach (Martinez and Martinez 2007) to compute the conditional expectations included in (3.5) and (3.6). More precisely, the conditional expectation $E[\mathcal{Y}_{i+1}\mid X_{i}]=\eta(X_{i})$, $X_{i}\in\mathbb{R}^{m}$, is represented by a tree, which is constructed with certain samples of $X_{i}$ and $\mathcal{Y}_{i+1}$. The tree will be reused to very efficiently generate samples of $\eta(\cdot)$ by directly taking a group of arbitrary samples of $X_{i}$ (generated from $X_{i-1}$) as one input for further iterations. That is to say, we deal with the projection of samples via trees using (3.5) and (3.6); the curse of dimensionality can thus be overcome by, for example, considering the group of $X_{i}\in\mathbb{R}^{m}$ generated from $X_{i-1}$ entirely as one input for the tree.

### 4.1 Nonparametric regression

Suppose that the model in nonparametric regression reads

 $Y=\eta(X)+\varepsilon,$ (4.1)

where $\varepsilon$ has a zero expectation and a constant variance, and it is independent of $X$. It holds that

 $E[Y\mid X=x]=\eta(x).$ (4.2)

The goal in regression is to find the estimator of this function, $\hat{\eta}(x)$. By nonparametric regression, we are not assuming a particular form for $\eta$. Instead, $\hat{\eta}$ is represented in a regression tree. Given a set of samples $(\hat{x}_{\mathcal{M}},\hat{y}_{\mathcal{M}})$, $\mathcal{M}=1,\dots,M$, for $(X,Y)$, where $X$ denotes a predictor variable and $Y$ presents the corresponding response variable, we construct a regression tree that can then be used to compute $E[Y\mid X=x]$ for an arbitrary $x$, whose value is not necessarily equal to one of the samples $\hat{x}_{\mathcal{M}}$.

As an example, we specify the procedure for (3.5) $(\theta_{1}=\frac{1}{2},\theta_{2}=1,\theta_{3}=\frac{1}{2})$. Assume that $(X_{i}^{\Delta_{t}})_{t_{i}\in\Delta_{t}}$ is Markovian, so (3.5) can thus be rewritten as

 $\displaystyle Z_{i}^{\Delta_{t}}=E\bigg{[}\frac{1}{\Delta t_{i}}Y_{i+1}^{% \Delta_{t}}\Delta W_{i+1}$ $\displaystyle+\tfrac{1}{2}f(t_{i+1},\mathbb{X}^{\Delta t}_{i+1})\Delta W_{i+1}% \biggm{|}X_{i}^{\Delta_{t}}\bigg{]},\quad{}i=N_{T}-1,\dots,0.$ (4.3)

There also exist deterministic functions $z_{i}^{\Delta_{t}}(x)$ such that

 $Z_{i}^{\Delta_{t}}=z_{i}^{\Delta_{t}}(X_{i}^{\Delta_{t}}).$ (4.4)

Starting from time $T$, we construct the regression tree $\hat{T}_{z}$ for the conditional expectation in (4.3) using the following samples:

 $\bigg{(}\hat{x}_{N_{T}-1,\mathcal{M}},\frac{1}{\Delta t_{N_{T}-1}}\hat{y}_{N_{% T},\mathcal{M}}\Delta\hat{w}_{N_{T},\mathcal{M}}+\tfrac{1}{2}\hat{f}_{N_{T},% \mathcal{M}}\Delta\hat{w}_{N_{T},\mathcal{M}}\bigg{)}.$

Thereby, the function

 $z_{N_{T}-1}^{\Delta_{t}}(x)=E\bigg{[}\frac{1}{\Delta t_{N_{T}-1}}Y_{N_{T}}^{% \Delta_{t}}\Delta W_{N_{T}}+\tfrac{1}{2}f(t_{N_{T}},\mathbb{X}^{\Delta t}_{N_{% T}})\Delta W_{N_{T}}\biggm{|}X_{N_{T}-1}^{\Delta_{t}}=x\bigg{]}$ (4.5)

is estimated and presented by a regression tree. By applying (4.5) to the samples $\hat{x}_{N_{T}-1,\mathcal{M}}$ we can directly obtain the samples $\hat{z}_{N_{T}-1,\mathcal{M}}$ of the random variable $Z_{N_{T}-1}^{\Delta_{t}}$ for $\mathcal{M}=1,\dots,M$. Recursively, backward in time, these samples $\hat{z}_{N_{T}-1,\mathcal{M}}$ will be used to generate samples $\hat{z}_{N_{T}-2,\mathcal{M}}$ of the random variables $Z_{N_{T}-2}^{\Delta_{t}}$ at time $t_{N_{T}-2}$. At the initial time $t=0$, we fix an initial value $x_{0}$ for $X_{t}$, and no samples are needed. Using the regression trees constructed at time $t_{1}$ we obtain the solution $Z_{0}^{\Delta_{t}}=z_{0}^{\Delta_{t}}(x_{0})$.

### 4.2 Binary regression tree

We review the procedure in Breiman et al (1984) and Martinez and Martinez (2007) for constructing the best regression tree based on the given samples. Basically, we need to grow, prune and finally select the tree. We use the following notation.

• $(\hat{x}_{\mathcal{M}},\hat{y}_{\mathcal{M}})$ denote samples, namely observed data.

• $\hat{t}$ is a node in the tree $\hat{T}$, and $\hat{t}_{\mathrm{L}}$ and $\hat{t}_{\mathrm{R}}$ are the left and right child nodes.

• $\mathcal{T}$ is the set of terminal nodes in the tree $\hat{T}$ with the number $|\mathcal{T}|$.

• $n(\hat{t})$ represents the number of samples in node $\hat{t}$.

• $\bar{y}(\hat{t})$ is the average of samples falling into node $\hat{t}$, namely the predicted response.

#### 4.2.1 Growing a tree

We define the predicted response as the average value of the samples that are contained in a node $\hat{t}$, ie,

 $\bar{y}(\hat{t})=\frac{1}{n(\hat{t})}\sum_{\hat{x}_{\mathcal{M}}\in\hat{t}}% \hat{y}_{\mathcal{M}}.$ (4.6)

Obviously, the squared error in the node $\hat{t}$ reads

 $R(\hat{t})=\frac{1}{n(\hat{t})}\sum_{\hat{x}_{\mathcal{M}}\in\hat{t}}(\hat{y}_% {\mathcal{M}}-\bar{y}(\hat{t}))^{2}.$ (4.7)

The mean squared error for the tree $\hat{T}$ is defined as

 $R(\hat{T})=\sum_{\hat{t}\in\mathcal{T}}R(\hat{t})=\frac{1}{n(\hat{t})}\sum_{% \hat{t}\in\mathcal{T}}\sum_{\hat{x}_{\mathcal{M}}\in\hat{t}}(\hat{y}_{\mathcal% {M}}-\bar{y}(\hat{t}))^{2}.$ (4.8)

Basically, the tree is constructed by partitioning the space for the samples $\hat{x}$ using a sequence of binary splits. For a split $s$ and node $\hat{t}$, the change in the mean squared error can be thus calculated as

 $\Delta R(s,\hat{t})=R(\hat{t})-R(\hat{t}_{\mathrm{L}})-R(\hat{t}_{\mathrm{R}}).$ (4.9)

The regression tree is thus obtained by iteratively splitting nodes with $s$, which yields the largest $\Delta R(s,\hat{t})$. Thereby, the decrease in $R(\hat{T})$ is maximized. Obviously, the optimal stopping criterion is that all responses in a terminal node are the same, but that is not really realistic.

#### 4.2.2 Pruning a tree

When using the optimal stopping criterion, each terminal node contains only one response, so the error $R(\hat{t})$, and thus $R(\hat{T})$, will be zero. However, first of all, this is unrealistic. Second, the samples are thereby overfitted and the regression tree will thus not generalize well to new observed samples. Breiman et al (1984) suggest growing an overly large regression tree $\hat{T}_{\text{max}}$ and then finding a nested sequence of subtrees by successively pruning the branches of the tree. This procedure is called pruning a tree. We define an error-complexity measure as

 $R_{\alpha}(\hat{T})=R(\hat{T})+\alpha|\mathcal{T}|,\quad\alpha\geq 0,$ (4.10)

where $\alpha$ represents the complexity cost per terminal node. The error complexity should be minimized by looking for trees. Let $\hat{T}_{\text{max}}$ be the overly large tree, in which each terminal node contains only one response. Thus, we have $R_{\alpha}(\hat{T}_{\text{max}})=\alpha|\mathcal{T}|$, which indicates a high cost of complexity, while the error is small. To minimize the cost we delete the branches with the weakest link $\hat{t}_{k}^{*}$ in tree $\hat{T}_{k}$, which is defined as

 $g_{k}(\hat{t}_{k}^{*})=\text{min}_{\hat{t}}\{g_{k}(\hat{t})\},\quad g_{k}(\hat% {t})=\frac{R(\hat{t})-R(\hat{T}_{k\hat{t}})}{|\mathcal{T}_{k\hat{t}}|-1},$ (4.11)

where $\hat{T}_{k\hat{t}}$ is the branch of $\hat{T}_{\hat{t}}$ corresponding to the internal node $\hat{t}$ of subtree $\hat{T}_{k}$. We prune the branch defined by the node $\hat{t}_{k}^{*}$,

 $\hat{T}_{k+1}=\hat{T}_{k}-\hat{T}_{\hat{t}_{k}^{*}},$ (4.12)

and thus we obtain a finite sequence of subtrees with fewer terminal nodes and decreasing complexity until the root node, which is

 $\hat{T}_{\text{max}}>\hat{T}_{1}>\hat{T}_{2}>\cdots>\hat{T}_{K}=\text{root}.$ (4.13)

On the other hand, we set

 $\alpha_{k+1}=g_{k}(\hat{t}_{k}^{*}),$ (4.14)

and we thus obtain an increasing sequence of values for the complexity parameter $\alpha$:

 $0=\alpha_{1}<\cdots<\alpha_{k}<\alpha_{k+1}<\cdots<\alpha_{K}.$ (4.15)

By observing both sequences (4.13) and (4.15), it is not difficult to see that for $k\geq 1$ the tree $\hat{T}_{k}$ is the one that has the minimal cost complexity for $\alpha_{k}\leq\alpha<\alpha_{k+1}$.

#### 4.2.3 Selecting a tree

We have to make a trade-off between both the error criteria and the complexity, ie, we need to choose the best tree from the sequence of pruned subtrees. To do this, we can use independent test samples and cross-validation. As an example, we illustrate the independent test sample method; for cross-validation we refer the reader to Breiman et al (1984) and Martinez and Martinez (2007). Clearly, we need honest estimates of the true error $R^{*}(\hat{T})$ to select the right size of tree. To obtain these estimates, we should use samples that were not used to construct the tree to estimate the error. Suppose we have a set of samples $L=(\hat{x}_{\mathcal{M}},\hat{y}_{\mathcal{M}})$, which can be randomly divided into two subsets $L_{1}$ and $L_{2}$. We use the set $L_{1}$ to grow a large tree and to obtain the sequence of pruned subtrees. Thus, the samples in $L_{2}$ are used to evaluate the performance of each subtree by calculating the error between the real response and predicted response. We denote by $\bar{y}_{k}(\hat{x})$ the predicted response using samples $\hat{x}$ for the tree $\hat{T}_{k}$. Then, the estimated error is

 $\hat{R}(\hat{T}_{k})=\frac{1}{n_{2}}\sum_{(\hat{x}_{i},\hat{y}_{i})\in L_{2}}(% \hat{y}_{i}-\bar{y}_{k}(\hat{x}_{i}))^{2},$ (4.16)

where $n_{2}$ is the number of samples in $L_{2}$. This estimated error will be calculated for all subtrees. We need to pick the subtree that has the smallest number of nodes but still keeps the accuracy of the tree with the smallest error, eg, $\hat{T}_{0}$ with an error $\hat{R}_{\text{min}}(\hat{T}_{0})$. To do this, we define the standard error for this estimate as (Breiman et al 1984)

 $\mathrm{SE}(\hat{R}_{\text{min}}(\hat{T}_{0})):=\frac{1}{\sqrt{n_{2}}}\sqrt{% \frac{1}{n_{2}}\sum_{i=1}^{n_{2}}(\hat{y}_{i}-\bar{y}(\hat{x_{i}}))^{4}-(\hat{% R}_{\text{min}}(\hat{T}_{0}))^{2}},$ (4.17)

and then we choose the smallest tree $\hat{T}_{k}^{*}$ such that

 $\hat{R}(\hat{T}_{k}^{*})\leq\hat{R}_{\text{min}}(\hat{T}_{0})+\mathrm{SE}(\hat% {R}_{\text{min}}(\hat{T}_{0})).$ (4.18)

$\hat{T}_{k}^{*}$ is a tree with minimal complexity cost but has equivalent accuracy to the tree with the minimum error.

### 4.3 Practical applications

Due to the linearity of conditional expectation, we construct the trees for all possible combinations of the conditional expectations. We denote the tree’s regression error by $R_{\text{tr}}$ and the error of the used iterative method by $R_{\text{impl}}$, and we reformulate the scheme (3.4)–(3.6) by combining conditional expectations as

 $\displaystyle\hat{y}_{N_{T},\mathcal{M}}$ $\displaystyle=g(\hat{x}_{N_{T},\mathcal{M}}),\qquad\hat{z}_{N_{T},\mathcal{M}}% =g_{x}(\hat{x}_{N_{T},\mathcal{M}}).$ (4.19)

For $i=N_{T}-1,\dots,0$, $\mathcal{M}=1,\dots,M$, we have

 $\displaystyle z_{i}^{\Delta t}(\hat{x}_{i,\mathcal{M}})$ $\displaystyle=E^{\hat{x}_{i,\mathcal{M}}}_{i}\bigg{[}\frac{1}{\Delta t_{i}}Y_{% i+1}\Delta W_{i+1}+\frac{1}{2}f(t_{i+1},\mathbb{X}_{i+1})\Delta W_{i+1}\bigg{]% }+R_{\text{tr}}^{Z_{i}},$ (4.20) $\displaystyle y_{i}^{\Delta t}(\hat{x}_{i,\mathcal{M}})$ $\displaystyle=E^{\hat{x}_{i,\mathcal{M}}}_{i}\bigg{[}Y_{i+1}+\frac{\Delta{t_{i% }}}{2}f(t_{i+1},\mathbb{X}_{i+1})\bigg{]}+\frac{\Delta{t_{i}}}{2}\hat{f}_{i,% \mathcal{M}}+R^{Y_{i}}_{\text{impl}}+R_{\text{tr}}^{Y_{i}},$ (4.21)

where $E^{\hat{x}_{i,\mathcal{M}}}_{i}[\mathcal{Y}]$ denotes the calculated conditional expectation $E[\mathcal{Y}\mid X=\hat{x}_{i,\mathcal{M}}]$ using the constructed regression tree with the samples of $\mathcal{Y}$. For example, using samples of the predictor variable $X_{i}$ (ie, $\hat{x}_{i,\mathcal{M}}$) and samples of the response variable

 $\frac{1}{\Delta t_{i}}Y_{i+1}\Delta W_{i+1}+\tfrac{1}{2}f(t_{i+1},\mathbb{X}_{% i+1})\Delta W_{i+1}$

(ie, $(1/\Delta t_{i})\hat{y}_{i+1,\mathcal{M}}\Delta\hat{w}_{i+1,\mathcal{M}}+(1/2)% \hat{f}_{i+1,\mathcal{M}}\Delta\hat{w}_{i+1,\mathcal{M}}$), we construct a regression tree. Then,

 $E^{\hat{x}_{i,\mathcal{M}}}_{i}\bigg{[}\frac{1}{\Delta t_{i}}Y_{i+1}\Delta W_{% i+1}+\tfrac{1}{2}f(t_{i+1},\mathbb{X}_{i+1})\Delta W_{i+1}\bigg{]}$

means the computed value of

 $E\bigg{[}\frac{1}{\Delta t_{i}}Y_{i+1}\Delta W_{i+1}+\tfrac{1}{2}f(t_{i+1},% \mathbb{X}_{i+1})\Delta W_{i+1}\biggm{|}X=\hat{x}_{i,\mathcal{M}}\bigg{]}$

using that constructed tree. This is to say that $\hat{y}_{i,\mathcal{M}}$ and $\hat{z}_{i,\mathcal{M}}$ are computed based on the tree at the $i$th step, which depends on $\hat{y}_{i+1,\mathcal{M}}$ and $\hat{z}_{i+1,\mathcal{M}}$. Let

 $\displaystyle\hat{\mathcal{Z}}_{i+1,\mathcal{M}}$ $\displaystyle:=\frac{1}{\Delta t_{i}}\hat{y}_{i+1,\mathcal{M}}\Delta\hat{w}_{i% +1,\mathcal{M}}+\tfrac{1}{2}\hat{f}_{i+1,\mathcal{M}}\Delta\hat{w}_{i+1,% \mathcal{M}},$ $\displaystyle\hat{\mathcal{Y}}_{i+1,\mathcal{M}}$ $\displaystyle:=\hat{y}_{i+1,\mathcal{M}}+\tfrac{1}{2}\hat{f}_{i+1,\mathcal{M}}.$

Thus, we define

 $R_{\text{tr}}^{Z_{i}}=\frac{1}{M}\sum_{\mathcal{M}=1}^{M}(\hat{z}_{i,\mathcal{% M}}-z_{i}^{\Delta_{t}}(\hat{x}_{i,\mathcal{M}}))^{2}$

and

 $R_{\text{tr}}^{Y_{i}}=\frac{1}{M}\sum_{\mathcal{M}=1}^{M}(\hat{y}_{i,\mathcal{% M}}-y_{i}^{\Delta_{t}}(\hat{x}_{i,\mathcal{M}}))^{2}$

with

 $z_{i}^{\Delta_{t}}(\hat{x}_{i,\mathcal{M}})=\hat{\mathcal{Z}}_{i+1,\mathcal{M}% }-\varepsilon_{i,\mathcal{M}}^{z}\quad\text{and}\quad y_{i}^{\Delta_{t}}(\hat{% x}_{i,\mathcal{M}})=\hat{\mathcal{Y}}_{i+1,\mathcal{M}}-\varepsilon_{i,% \mathcal{M}}^{y},$

while $\smash{\varepsilon_{i}^{z}}$ and $\smash{\varepsilon_{i}^{y}}$ have zero mean and constant variance $\smash{\operatorname{Var}_{i}^{z}}$ and $\smash{\operatorname{Var}_{i}^{y}}$, respectively. $\smash{R_{\text{tr}}^{Z_{i}}}$ and $\smash{R_{\text{tr}}^{Y_{i}}}$ can thus be reformulated as

 $\displaystyle R_{\text{tr}}^{Z_{i}}$ $\displaystyle=\frac{1}{M}\sum_{\mathcal{M}=1}^{M}(\varepsilon_{i,\mathcal{M}}^% {z}-(\hat{\mathcal{Z}}_{i+1,\mathcal{M}}-\hat{z}_{i,\mathcal{M}}))^{2}$ $\displaystyle\leq 2(\operatorname{Var}_{i}^{z}+\hat{R}_{\text{min}}(\hat{T}^{Z% _{i}})),$ (4.22)

and

 $\displaystyle R_{\text{tr}}^{Y_{i}}$ $\displaystyle=\frac{1}{M}\sum_{\mathcal{M}=1}^{M}(\varepsilon_{i,\mathcal{M}}^% {y}-(\hat{\mathcal{Y}}_{i+1,\mathcal{M}}-\hat{y}_{i,\mathcal{M}}))^{2}$ $\displaystyle\leq 2(\operatorname{Var}_{i}^{y}+\hat{R}_{\text{min}}(\hat{T}^{Y% _{i}})),$ (4.23)

where $\hat{R}_{\text{min}}(\hat{T}^{Z_{i}})$ and $\hat{R}_{\text{min}}(\hat{T}^{Y_{i}})$ are defined via (4.16). $\hat{T}^{Z_{i}}$ and $\hat{T}^{Y_{i}}$ denote the trees with the smallest error at the $i$th step. Note that (4.17) and (4.18) are used just to find the tree with the smallest error in (4.16), which has the smallest number of nodes.

Theoretically, the regression error can be neglected by choosing sufficiently large $M$. However, the tree-based approach is computationally not that efficient for a reasonably high value of $M$. Due to the fact that we are only interested in the solutions of $Y_{0}$ and $Z_{0}$ conditional on a known present value of $X_{0}$, we propose to split a large sample into a few groups of smaller samples at $t_{N_{T}}$ and project samples for $t_{N}\to t_{1}$ in different groups. The main reason why we do this is that multiple tree-based computations for a small sample are still more efficient than one computation for a large number of samples. The computation can be done rapidly in the case of a constant predictor, and the quality of approximations for $Y^{\Delta_{t}}_{0}$ and $Z^{\Delta_{t}}_{0}$ relies directly on the samples of $Y^{\Delta_{t}}_{1}$ and $Z^{\Delta_{t}}_{1}$. Therefore, for the step $t_{1}\to t_{0}$, we combine the samples of $Y^{\Delta_{t}}_{1}$ and $Z^{\Delta_{t}}_{1}$ from all groups, which are used as our samples of response variables. We also use $W_{0}=0$ or $X_{0}=x_{0}$ as the predictor variable. Our numerical results show that the splitting error of the samples’ projection from $t_{N_{T}}\to t_{1}$ could be neglected. As an example, we propose the following fully discrete scheme by choosing $\theta_{1}=\frac{1}{2}$, $\theta_{2}=1$ and $\theta_{3}=\frac{1}{2}$.

1. (1)

Generate $M$ samples and split them into $G$ different groups, where the sample number in each group is $M_{g}=M/G$.

2. (2)

For each group, namely $\mathcal{M}=1,\dots,M_{g}$, compute

 $\displaystyle\hat{y}_{N_{T},\mathcal{M}}=g(\hat{x}_{N_{T},\mathcal{M}}),\qquad% \hat{z}_{N_{T},\mathcal{M}}=g_{x}(\hat{x}_{N_{T},\mathcal{M}}).$

For $i=N_{T}-1,\dots,1$, $\mathcal{M}=1,\dots,M_{g}$,

 $\displaystyle\hat{z}_{i,\mathcal{M}}$ $\displaystyle=E^{\hat{x}_{i,\mathcal{M}}}_{i}\bigg{[}\frac{1}{\Delta t_{i}}Y_{% i+1}^{\Delta_{t}}\Delta W_{i+1}+\tfrac{1}{2}f(t_{i+1},\mathbb{X}^{\Delta_{t}}_% {i+1})\Delta W_{i+1}\bigg{]},$ $\displaystyle\hat{y}_{i,\mathcal{M}}$ $\displaystyle=E^{\hat{x}_{i,\mathcal{M}}}_{i}\bigg{[}Y_{i+1}^{\Delta_{t}}+% \frac{\Delta{t_{i}}}{2}f(t_{i+1},\mathbb{X}^{\Delta_{t}}_{i+1})\bigg{]}+\frac{% \Delta{t_{i}}}{2}\hat{f}_{i,\mathcal{M}}.$
3. (3)

Collect all the samples of $(\hat{z}_{1,\mathcal{M}},\hat{y}_{1,\mathcal{M}})$ for $\mathcal{M}=1,\dots,M$ and use all these samples to compute

 $\displaystyle Z_{0}^{\Delta_{t}}$ $\displaystyle=E^{x_{0}}_{0}\bigg{[}\frac{1}{\Delta t_{0}}Y_{1}^{\Delta_{t}}% \Delta W_{1}+\tfrac{1}{2}f(t_{1},\mathbb{X}^{\Delta_{t}}_{1})\Delta W_{1}\bigg% {]},$ $\displaystyle Y_{0}^{\Delta_{t}}$ $\displaystyle=E^{x_{0}}_{0}\bigg{[}Y_{1}^{\Delta_{t}}+\frac{\Delta{t_{0}}}{2}f% (t_{1},\mathbb{X}^{\Delta_{t}}_{1})\bigg{]}+\frac{\Delta{t_{0}}}{2}\hat{f}_{0,% \mathcal{M}}.$

### 4.4 Error estimates

We perform the error analysis for the scheme $\theta_{1}=\frac{1}{2}$, $\theta_{2}=1$ and $\theta_{3}=\frac{1}{2}$; a similar analysis can be done for other choices of $\theta_{l}$. Supposing that the errors of the iterative method can be neglected by choosing a sufficiently high number of Picard iterations, we consider the initial discretization and regression errors. The errors due to the time-discretization are given by

 $\displaystyle\varepsilon^{Y_{i},\theta}$ $\displaystyle:=Y_{i}-Y_{i}^{\Delta_{t}},$ $\displaystyle\varepsilon^{Z_{i},\theta}$ $\displaystyle:=Z_{i}-Z_{i}^{\Delta_{t}},$ $\displaystyle\varepsilon^{f_{i},\theta}$ $\displaystyle:=f(t_{i},\mathbb{X}_{i})-f(t_{i},\mathbb{X}_{i}^{\Delta_{t}}).$

Analogously to the deterministic functions $z_{i}^{\Delta_{t}}$ given in (4.4), we define the deterministic functions $y_{i}^{\Delta_{t}}$ as

 $Y_{i}^{\Delta_{t}}=y_{i}^{\Delta_{t}}(X_{i}^{\Delta_{t}}).$

These functions are approximated by the regression trees, resulting in the approximations $\hat{y}_{i}^{\Delta_{t}},\hat{z}_{i}^{\Delta_{t}}$ with

 $\hat{Y}_{i}^{\Delta_{t}}=\hat{y}_{i}^{\Delta_{t}}(X_{i}^{\Delta_{t}})\quad% \text{and}\quad\hat{Z}_{i}^{\Delta_{t}}=\hat{z}_{i}^{\Delta_{t}}(X_{i}^{\Delta% _{t}}).$

Thus, we denote the global errors by

 $\displaystyle\varepsilon^{Y_{i}}$ $\displaystyle:=Y_{i}-\hat{Y}_{i}^{\Delta_{t}},$ (4.24) $\displaystyle\varepsilon^{Z_{i}}$ $\displaystyle:=Z_{i}-\hat{Z}_{i}^{\Delta_{t}},$ (4.25) $\displaystyle\varepsilon^{f_{i}}$ $\displaystyle:=f(t_{i},\mathbb{X}_{i})-f(t_{i},\hat{\mathbb{X}}_{i}^{\Delta_{t% }}).$ (4.26)
###### Assumption 4.1.

Suppose that $X_{0}$ is $\mathcal{F}_{0}$-measurable with $E[|X_{0}|^{2}]<\infty$, and that $a$ and $b$ are $L^{2}$-measurable in $(t,x)\in[0,T]\times\mathbb{R}^{d}$, as well as linear-growth-bounded and uniformly Lipschitz continuous, ie, there exist positive constants $K$ and $L$ such that

 $\displaystyle|a(t,x)|^{2}$ $\displaystyle\leq K(1+|x|^{2}),$ $\displaystyle|b(t,x)|^{2}$ $\displaystyle\leq K(1+|x|^{2}),$ $\displaystyle|a(t,x)-a(t,y)|$ $\displaystyle\leq L|x-y|,$ $\displaystyle|b(t,x)-b(t,y)|$ $\displaystyle\leq L|x-y|,$

with $\smash{x,y\in\mathbb{R}^{d}}$.

Let $\smash{C_{b}^{l,k,k,k}}$ be the set of continuously differentiable functions $f\colon[0,T]\times\smash{\mathbb{R}^{n}}\times\smash{\mathbb{R}^{m}}\times% \smash{\mathbb{R}^{m}\times d}\to\smash{\mathbb{R}^{m}}$ with uniformly bounded partial derivatives $\smash{\partial_{t}^{l_{1}}f}$ for $\frac{1}{2}\leq l_{1}\leq l$ and $\partial_{x}^{k_{1}}\partial_{y}^{k_{2}}\partial_{z}^{k_{3}}f$ for $1\leq k_{1}+k_{2}+k_{3}\leq k$; $\smash{C_{b}^{l,k,k}}$ and $\smash{C_{b}^{l,k}}$ can be defined analogously, and $\smash{C_{b}^{k}}$ is the set of functions $\smash{g\colon\mathbb{R}^{d}\to\mathbb{R}^{m}}$ with uniformly bounded partial derivatives $\partial_{x}^{k_{1}}g$ for $1\leq k_{1}\leq k$.

We give some remarks concerning related results on the one-step scheme.

• Under Assumption 4.1, if $\smash{f\in C_{b}^{2,4,4,4}}$, $g\in C_{b}^{4+\alpha}$ for some $\alpha\in(0,1)$, $a$ and $b$ are bounded and $a,b\in C_{b}^{2,4}$, then the absolute values of the local errors $R_{\theta}^{Y_{i}}$ and $R_{\theta}^{Z_{i}}$ in (3.6) and (3.5) can be bounded by $C(\Delta t_{i})^{3}$ when $\theta_{l}=\frac{1}{2}$ and $l=1,2,3$ and by $C(\Delta t_{i})^{2}$ when $\theta_{1}=\frac{1}{2}$, $\theta_{2}=1$ and $\theta_{3}=\frac{1}{2}$, where $C$ is a constant that can depend on $T$, $x_{0}$ and the bounds of $a$, $b$, $f$ and $g$ in (1.1) (see, for example, Li et al 2017; Zhao et al 2009, 2012, 2013, 2014).

• For convenience of notation we can omit the dependency of local and global errors on the state of the FBSDEs and the discretization errors of $\mathrm{d}X_{t}$, ie, we assume $X_{i}=X_{i}^{\Delta_{t}}$.

• For the implicit schemes we will apply Picard iterations that converge due to the Lipschitz assumptions on the driver and for any initial guess when $\Delta t_{i}$ is small enough. In the following analysis, we consider the equidistant time discretization $\Delta t$.

For the $Z$-component $(0\leq i\leq N_{T}-1)$ we have (see (4.20))

 $\varepsilon^{Z_{i}}=E_{i}^{x_{i}}\bigg{[}\frac{1}{\Delta t}\varepsilon^{Y_{i+1% }}\Delta W_{i+1}+\tfrac{1}{2}\varepsilon^{f_{i+1}}\Delta W_{i+1}\bigg{]}+\frac% {R_{\theta}^{Z_{i}}}{\Delta t}+R_{\text{tr}}^{Z_{i}},$ (4.27)

where $\varepsilon^{f_{i+1}}$ can be bounded using Lipschitz continuity of $f$ by

 $\displaystyle E_{i}^{x_{i}}[|\varepsilon^{f_{i+1}}|^{2}]$ $\displaystyle\leq E_{i}^{x_{i}}[|L(|\varepsilon^{Y_{i+1}}|+|\varepsilon^{Z_{i+% 1}}|)|^{2}]$ $\displaystyle\leq 2L^{2}E_{i}^{x_{i}}[|\varepsilon^{Y_{i+1}}|^{2}+|\varepsilon% ^{Z_{i+1}}|^{2}],$ (4.28)

with Lipschitz constant $L$. And it holds that

 $\displaystyle|E_{i}^{x_{i}}[\varepsilon^{Y_{i+1}}\Delta W_{i+1}]|^{2}$ $\displaystyle=|E_{i}^{x_{i}}[(\varepsilon^{Y_{i+1}}-E_{i}^{x_{i}}[\varepsilon^% {Y_{i+1}}])\Delta W_{i+1}]|^{2}$ $\displaystyle\leq\Delta t(E_{i}^{x_{i}}[|\varepsilon^{Y_{i+1}}|^{2}]-|E_{i}^{x% _{i}}[\varepsilon^{Y_{i+1}}]|^{2}).$ (4.29)

Consequently, we calculate

 $\displaystyle(\Delta t)^{2}|\varepsilon^{Z_{i}}|^{2}$ $\displaystyle\leq 2\Delta t(E_{i}^{x_{i}}[|\varepsilon^{Y_{i+1}}|^{2}]-|E_{i}^% {x_{i}}[\varepsilon^{Y_{i+1}}]|^{2})$ $\displaystyle\qquad+3L^{2}(\Delta t)^{3}E_{i}^{x_{i}}[|\varepsilon^{Y_{i+1}}|^% {2}+|\varepsilon^{Z_{i+1}}|^{2}]$ $\displaystyle\qquad+6|R_{\theta}^{Z_{i}}|^{2}+6(\Delta t)^{2}|R_{\text{tr}}^{Z% _{i}}|^{2},$ (4.30)

where Hölder’s inequality is used.

For the $Y$-component in the implicit scheme we have

 $\varepsilon^{Y_{i}}=E^{x_{i}}_{i}\bigg{[}\varepsilon^{Y_{i+1}}+\frac{\Delta{t}% }{2}\varepsilon^{f_{i+1}}\bigg{]}+\frac{\Delta{t}}{2}\varepsilon^{f_{i}}+R^{Y_% {i}}_{\theta}+R_{\mathrm{tr}}^{Y_{i}}.$ (4.31)

This error can be bounded by

 $\displaystyle|\varepsilon^{Y_{i}}|$ $\displaystyle\leq|E^{x_{i}}_{i}[\varepsilon^{Y_{i+1}}]|+\frac{\Delta tL}{2}(|% \varepsilon^{Y_{i}}|+|\varepsilon^{Z_{i}}|)$ $\displaystyle\qquad+\frac{\Delta tL}{2}E^{x_{i}}_{i}[|\varepsilon^{Y_{i+1}}|+|% \varepsilon^{Z_{i+1}}|]+|R_{\theta}^{Y_{i}}|+|R_{\mathrm{tr}}^{Y_{i}}|.$ (4.32)

By using the inequality $(a+b)^{2}\leq a^{2}+b^{2}+\gamma\Delta ta^{2}+(b^{2}/\gamma\Delta t)$ we calculate

 $\displaystyle|\varepsilon^{Y_{i}}|^{2}$ $\displaystyle\leq(1+\gamma\Delta t)|E^{x_{i}}_{i}[\varepsilon^{Y_{i+1}}]|^{2}+% \frac{3(\Delta tL)^{2}}{2}(|\varepsilon^{Y_{i}}|^{2}+|\varepsilon^{Z_{i}}|^{2})$ $\displaystyle\quad+\frac{3(\Delta tL)^{2}}{2}(E^{x_{i}}_{i}[|\varepsilon^{Y_{i% +1}}|^{2}]+E^{x_{i}}_{i}[|\varepsilon^{Z_{i+1}}|^{2}])+6|R_{\theta}^{Y_{i}}|^{% 2}+6|R_{\text{tr}}^{Y_{i}}|^{2}$ $\displaystyle\quad+\frac{1}{\gamma}\bigg{(}\frac{3\Delta tL^{2}}{2}(|% \varepsilon^{Y_{i}}|^{2}+|\varepsilon^{Z_{i}}|^{2})+\frac{3\Delta tL^{2}}{2}(E% ^{x_{i}}_{i}[|\varepsilon^{Y_{i+1}}|^{2}]+E^{x_{i}}_{i}[|\varepsilon^{Z_{i+1}}% |^{2}])+\frac{6|R_{\theta}^{Y_{i}}|^{2}}{\Delta t}+\frac{6|R_{\text{tr}}^{Y_{i% }}|^{2}}{\Delta t}\bigg{)}.$ (4.33)
###### Theorem 4.2.

Under Assumption 4.1, if $f\in C_{b}^{2,4,4,4}$, $g\in C_{b}^{4+\alpha}$ for some $\alpha\in(0,1)$, $a$ and $b$ are bounded, $a,b\in C_{b}^{2,4}$, then given

 $\displaystyle E_{N_{T}-1}^{x_{N_{T}-1}}[|\varepsilon^{Z_{N_{T}}}|^{2}]$ $\displaystyle\thicksim\mathcal{O}((\Delta t)^{2}),$ $\displaystyle E_{N_{T}-1}^{x_{N_{T}-1}}[|\varepsilon^{Y_{N_{T}}}|^{2}]$ $\displaystyle\thicksim\mathcal{O}((\Delta t)^{2}),$

it holds that

 $\displaystyle E_{0}^{x_{0}}\bigg{[}|\varepsilon^{Y_{i}}|^{2}+\frac{\Delta t|% \varepsilon^{Z_{i}}|^{2}}{4}\bigg{]}$ $\displaystyle\qquad\leq Q(\Delta t)^{2}+\tilde{Q}\sum_{i+1}^{N_{T}}\bigg{(}G% \bigg{(}(\operatorname{Var}^{Y}_{i})^{2}+\frac{N_{T}(\operatorname{Var}_{i}^{Y% })^{2}}{T}+\frac{T(\operatorname{Var}_{i}^{Z})^{2}}{N_{T}}\bigg{)}\bigg{)},$ $\displaystyle\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad% \qquad\quad 0\leq i\leq N_{T}-1,$ (4.34)

where $Q$ is a constant that depends only on $T$, $x_{0}$ and the bounds of $f$, $g$ and $a$, $b$ in (1.1); $\tilde{Q}$ is a constant depending on $T$, $x_{0}$ and $L$; $\operatorname{Var}^{Y}_{i}$ and $\operatorname{Var}_{i}^{Z}$ are the bounded constants; and $M$ and $G$ are the number of samples and splitting groups, respectively.

###### Proof.

By combining both (4.30) and (4.33), it is straightforward to obtain

 $\displaystyle E_{i}^{x_{i}}[|\varepsilon^{Y_{i}}|^{2}]+\frac{\Delta t}{2}E_{i}% ^{x_{i}}[|\varepsilon^{Z_{i}}|^{2}]$ $\displaystyle\quad\leq(1+\gamma\Delta t)|E^{x_{i}}_{i}[\varepsilon^{Y_{i+1}}]|% ^{2}+\frac{3(\Delta tL)^{2}}{2}(E_{i}^{x_{i}}[|\varepsilon^{Y_{i}}|^{2}]+E_{i}% ^{x_{i}}[|\varepsilon^{Z_{i}}|^{2}])$ $\displaystyle\quad\qquad+\frac{3(\Delta tL)^{2}}{2}(E_{i}^{x_{i}}[|\varepsilon% ^{Y_{i+1}}|^{2}]+E_{i}^{x_{i}}[|\varepsilon^{Z_{i+1}}|^{2}])$ $\displaystyle\quad\qquad+6E_{i}^{x_{i}}[|R_{\theta}^{Y_{i}}|^{2}]+6E_{i}^{x_{i% }}[|R_{\text{tr}}^{Y_{i}}|^{2}]+(E_{i}^{x_{i}}[|\varepsilon^{Y_{i+1}}|^{2}]-|E% _{i}^{x_{i}}[\varepsilon^{Y_{i+1}}]|^{2})$ $\displaystyle\quad\qquad+\frac{3(\Delta tL)^{2}}{2}(E_{i}^{x_{i}}[|\varepsilon% ^{Y_{i+1}}|^{2}]+E_{i}^{x_{i}}[|\varepsilon^{Z_{i+1}}|^{2}])+3\frac{E_{i}^{x_{% i}}[|R_{\theta}^{Z_{i}}|^{2}]}{\Delta t}$ $\displaystyle\quad\qquad+3\Delta tE_{i}^{x_{i}}[|R_{\text{tr}}^{Z_{i}}|^{2}]$ $\displaystyle\quad\qquad+\frac{1}{\gamma}\bigg{(}\frac{3\Delta tL^{2}}{2}(E_{i% }^{x_{i}}[|\varepsilon^{Y_{i}}|^{2}]+E_{i}^{x_{i}}[|\varepsilon^{Z_{i}}|^{2}])% +\frac{3\Delta tL^{2}}{2}(E_{i}^{x_{i}}[|\varepsilon^{Y_{i+1}}|^{2}]+E_{i}^{x_% {i}}[|\varepsilon^{Z_{i+1}}|^{2}])+\frac{6E_{i}^{x_{i}}[|R_{\theta}^{Y_{i}}|^{% 2}]}{\Delta t}+\frac{6E_{i}^{x_{i}}[|R_{\text{tr}}^{Y_{i}}|^{2}]}{\Delta t}% \bigg{)},$

which implies

 $\displaystyle\bigg{(}1-\frac{3(\Delta tL)^{2}}{2}-\frac{3\Delta tL^{2}}{2% \gamma}\bigg{)}E_{i}^{x_{i}}[|\varepsilon^{Y_{i}}|^{2}]+\bigg{(}\frac{\Delta t% }{2}-\frac{3(\Delta tL)^{2}}{2}-\frac{3\Delta tL^{2}}{2\gamma}\bigg{)}E_{i}^{x% _{i}}[|\varepsilon^{Z_{i}}|^{2}]$ $\displaystyle\qquad\leq\bigg{(}1+\gamma\Delta t+3(\Delta tL)^{2}+\frac{3\Delta tL% ^{2}}{2\gamma}\bigg{)}E_{i}^{x_{i}}[|\varepsilon^{Y_{i+1}}|^{2}]$ $\displaystyle\qquad\qquad+\bigg{(}3(\Delta tL)^{2}+\frac{3\Delta tL^{2}}{2% \gamma}\bigg{)}E_{i}^{x_{i}}[|\varepsilon^{Z_{i+1}}|^{2}]+6E_{i}^{x_{i}}[|R_{% \theta}^{Y_{i}}|^{2}]+6E_{i}^{x_{i}}[|R_{\text{tr}}^{Y_{i}}|^{2}]$ $\displaystyle\qquad\qquad+\frac{6E_{i}^{x_{i}}[|R_{\theta}^{Y_{i}}|^{2}]}{% \gamma\Delta t}+\frac{6E_{i}^{x_{i}}[|R_{\text{tr}}^{Y_{i}}|^{2}]}{\gamma% \Delta t}+\frac{3E_{i}^{x_{i}}[|R_{\theta}^{Z_{i}}|^{2}]}{\Delta t}+3\Delta tE% _{i}^{x_{i}}[|R_{\text{tr}}^{Z_{i}}|^{2}].$

We choose $\gamma$ such that

 $\Delta t/2-3\Delta tL^{2}/2\gamma\geq 3\Delta tL^{2}/2\gamma$

(ie, $\gamma\geq 6L^{2}$). Thus, the above inequality can be rewritten as

 $\displaystyle\bigg{(}1-\frac{3(\Delta tL)^{2}}{2}-\frac{3\Delta tL^{2}}{2% \gamma}\bigg{)}E_{i}^{x_{i}}[|\varepsilon^{Y_{i}}|^{2}]+\bigg{(}\frac{3\Delta tL% ^{2}}{2\gamma}-\frac{3(\Delta tL)^{2}}{2}\bigg{)}E_{i}^{x_{i}}[|\varepsilon^{Z% _{i}}|^{2}]$ $\displaystyle\qquad\leq\bigg{(}1+\gamma\Delta t+3(\Delta tL)^{2}+\frac{3\Delta tL% ^{2}}{2\gamma}\bigg{)}E_{i}^{x_{i}}[|\varepsilon^{Y_{i+1}}|^{2}]$ $\displaystyle\qquad\qquad+\bigg{(}3(\Delta tL)^{2}+\frac{3\Delta tL^{2}}{2% \gamma}\bigg{)}E_{i}^{x_{i}}[|\varepsilon^{Z_{i+1}}|^{2}]+6E_{i}^{x_{i}}[|R_{% \theta}^{Y_{i}}|^{2}]+6E_{i}^{x_{i}}[|R_{\text{tr}}^{Y_{i}}|^{2}]$ $\displaystyle\qquad\qquad+\frac{E_{i}^{x_{i}}[|R_{\theta}^{Y_{i}}|^{2}]}{L^{2}% \Delta t}+\frac{E_{i}^{x_{i}}[|R_{\text{tr}}^{Y_{i}}|^{2}]}{L^{2}\Delta t}+% \frac{3E_{i}^{x_{i}}[|R_{\theta}^{Z_{i}}|^{2}]}{\Delta t}+3\Delta tE_{i}^{x_{i% }}[|R_{\text{tr}}^{Z_{i}}|^{2}],$

which implies

 $\displaystyle E_{i}^{x_{i}}[|\varepsilon^{Y_{i}}|^{2}]+\frac{\Delta t}{4}E_{i}% ^{x_{i}}[|\varepsilon^{Z_{i}}|^{2}]$ $\displaystyle\qquad\leq\frac{1+C\Delta t}{1-C\Delta t}\bigg{(}\bigg{(}E_{i}^{x% _{i}}[|\varepsilon^{Y_{i+1}}|^{2}]+\frac{\Delta t}{4}E_{i}^{x_{i}}[|% \varepsilon^{Z_{i+1}}|^{2}]\bigg{)}$ $\displaystyle\qquad\qquad\qquad\qquad+6E_{i}^{x_{i}}[|R_{\theta}^{Y_{i}}|^{2}]% +6E_{i}^{x_{i}}[|R_{\text{tr}}^{Y_{i}}|^{2}]$ $\displaystyle\qquad\qquad\qquad\qquad+\frac{E_{i}^{x_{i}}[|R_{\theta}^{Y_{i}}|% ^{2}]}{L^{2}\Delta t}+\frac{E_{i}^{x_{i}}[|R_{\text{tr}}^{Y_{i}}|^{2}]}{L^{2}% \Delta t}$ $\displaystyle\qquad\qquad\qquad\qquad+\frac{3E_{i}^{x_{i}}[|R_{\theta}^{Z_{i}}% |^{2}]}{\Delta t}+3\Delta tE_{i}^{x_{i}}[|R_{\text{tr}}^{Z_{i}}|^{2}]\bigg{)}.$

By induction, we then obtain

 $\displaystyle E_{i}^{x_{i}}[|\varepsilon^{Y_{i}}|^{2}]+\frac{\Delta t}{4}E_{i}% ^{x_{i}}[|\varepsilon^{Z_{i}}|^{2}]$ $\displaystyle\qquad\leq\bigg{(}\frac{1+C\Delta t}{1-C\Delta t}\bigg{)}^{\!N_{T% }-i}\bigg{(}E_{N_{T}-1}^{x_{N_{T}-1}}[|\varepsilon^{Y_{N_{T}}}|^{2}]+\frac{% \Delta t}{4}E_{N_{T}-1}^{x_{N_{T}-1}}[|\varepsilon^{Z_{N_{T}}}|^{2}]\bigg{)}$ $\displaystyle\qquad\qquad+\sum_{j=i+1}^{N_{T}}\bigg{(}\frac{1+C\Delta t}{1-C% \Delta t}\bigg{)}^{\!j-i}$ $\displaystyle\qquad\qquad\times\bigg{(}6E_{i}^{x_{i}}[|R_{\theta}^{Y_{j}}|^{2}% ]+\frac{E_{i}^{x_{i}}[|R_{\theta}^{Y_{j}}|^{2}]}{L^{2}\Delta t}+\frac{3E_{i}^{% x_{i}}[|R_{\theta}^{Z_{j}}|^{2}]}{\Delta t}$ $\displaystyle\qquad\qquad\qquad+6E_{i}^{x_{i}}[|R_{\text{tr}}^{Y_{j}}|^{2}]+% \frac{E_{i}^{x_{i}}[|R_{\text{tr}}^{Y_{j}}|^{2}]}{L^{2}\Delta t}+3\Delta tE_{i% }^{x_{i}}[|R_{\text{tr}}^{Z_{j}}|^{2}]\bigg{)}$ $\displaystyle\qquad\leq\exp(2CT)\bigg{(}E_{N_{T}-1}^{x_{N_{T}-1}}[|\varepsilon% ^{Y_{N_{T}}}|^{2}]+\frac{\Delta t}{4}E_{N_{T}-1}^{x_{N_{T}-1}}[|\varepsilon^{Z% _{N_{T}}}|^{2}]\bigg{)}$ $\displaystyle\qquad\qquad+\exp(2CT)\sum_{j=i+1}^{N_{T}}\bigg{(}6E_{i}^{x_{i}}[% |R_{\theta}^{Y_{j}}|^{2}]+\frac{E_{i}^{x_{i}}[|R_{\theta}^{Y_{j}}|^{2}]}{L^{2}% \Delta t}+\frac{3E_{i}^{x_{i}}[|R_{\theta}^{Z_{j}}|^{2}]}{\Delta t}+6E_{i}^{x_% {i}}[|R_{\text{tr}}^{Y_{j}}|^{2}]+\frac{E_{i}^{x_{i}}[|R_{\text{tr}}^{Y_{j}}|^% {2}]}{L^{2}\Delta t}+3\Delta tE_{i}^{x_{i}}[|R_{\text{tr}}^{Z_{j}}|^{2}]\bigg{% )}.$

The regression errors $R^{Z}_{\text{tr}}$ and $R^{Y}_{\text{tr}}$ with $M$ samples are given by (4.22) and (4.23), from which we can deduce, for example,

 $|R_{\text{tr}}^{Y_{j}}|^{2}\leq G|2(\operatorname{Var}_{j}^{y}+\hat{R}_{\text{% min}}(\hat{T}^{Y_{j}}))|^{2}:=G(\operatorname{Var}^{Y}_{j})^{2}.$

Similarly, for $|R_{\text{tr}}^{Y_{j}}|^{2}/\Delta t$ and $\Delta t|R_{\text{tr}}^{Z_{j}}|^{2}$ we obtain

 $\frac{GN_{T}(\operatorname{Var}_{j}^{Y})^{2}}{T}\quad\text{and}\quad\frac{GT(% \operatorname{Var}_{j}^{Z})^{2}}{N_{T}},$

respectively. Finally, with the known conditions and bounds of the local errors mentioned above, the proof is complete. ∎

## 5 Numerical experiments

In this section we use some numerical examples to show the accuracy of our methods for solving the FBSDEs. $N_{T}$ and $M$ are the total number of discrete time steps and the sample size, respectively. For all the examples, we consider an equidistant time partition and perform 20 Picard iterations. We run the algorithms 10 times independently and take the average value of the absolute error, where two different seeds are used every five simulations. Note that in the following numerical experiments we present the errors of $Y_{0}^{\Delta t}$ and $Z_{0}^{\Delta t}$ corresponding to Theorem 4.2. The errors at other time steps and for different fixed sample points are similar to the results in our test. The numerical experiments are performed using an Intel Core i5-8500 3.00 GHz processor with 15 GB RAM.

### 5.1 Example of linear BSDE

The first BSDE we consider is

 $-\mathrm{d}Y_{t}=\bigg{(}\frac{Y_{t}}{2}-\frac{Z_{t}}{2}\bigg{)}\mathrm{d}t-Z_% {t}\mathrm{d}W_{t},\qquad Y_{T}=\sin\bigg{(}W_{T}+\frac{T}{2}\bigg{)},$ (5.1)

with the analytical solution

 $Y_{t}=\sin\bigg{(}W_{t}+\frac{t}{2}\bigg{)},\qquad Z_{t}=\cos\bigg{(}W_{t}+% \frac{t}{2}\bigg{)}.$ (5.2)

The generator $f$ contains the components $Y_{t}$ and $Z_{t}$. We set $T=\frac{1}{2}$, where the analytical solution of $(Y_{0},Z_{0})$ is $(0,1)$.

First, in order to see the computational acceleration by using the sample-splitting introduced above, we compare the scheme of using the sample-splitting (samples merging at different steps) with not using the sample-splitting in Figure 1. Letting $\smash{Y^{\Delta_{t}}_{0,k}}$ and $\smash{Z^{\Delta_{t}}_{0,k}}$ denote the result on the $k$th run of the algorithm, $k=1,\dots,10$, the approximations read as

 $Y^{\Delta_{t}}_{0}=\frac{1}{10}\sum_{k=1}^{10}Y^{\Delta_{t}}_{0,k}\quad\text{% and}\quad Z^{\Delta_{t}}_{0}=\frac{1}{10}\sum_{k=1}^{10}Z^{\Delta_{t}}_{0,k}.$

In our tests we consider the average of the absolute errors: that is,

 $\frac{1}{10}\sum_{k=1}^{10}|Y^{\Delta_{t}}_{0,k}-Y_{0}|\quad\text{and}\quad% \frac{1}{10}\sum_{k=1}^{10}|Z^{\Delta_{t}}_{0,k}-Z_{0}|.$

We see that there are no considerable differences for approximating $Y_{0}$. Also, the approximations of $Z_{0}$ with the sample-splitting against $M$ converge in a very stable fashion. The sample-splitting allows a very efficient computation, especially merging at the penultimate step. In the remainder of this paper we perform all the schemes, always using the splitting with $M_{g}=1000$ unless otherwise specified. Figure 1: Comparison of using and not using sample-splitting for NT=10. The average run times are given in seconds. (a) 110⁢∑k=110|Y0,kΔt-Y0|. (b) 110⁢∑k=110|Z0,kΔt-Z0|.

Next we study the influence of $M$ on the error. We fix the number of steps to $2$ and test all possible values of $\theta_{l}$. We find that the explicit schemes for $\theta_{3}=0$, $\theta_{2}=1$ and $\theta_{1}=\frac{1}{2},1$ and the implicit schemes for $\theta_{3}=\frac{1}{2},1$, $\theta_{2}=1$ and $\theta_{1}=\frac{1}{2},1$ show good convergence in terms of $M$. As an example we report the absolute errors and the empirical standard deviations

 $\sqrt{\frac{1}{9}\sum_{k=1}^{10}|Y^{\Delta_{t}}_{0,k}-Y^{\Delta_{t}}_{0}|^{2}}% ,\qquad\sqrt{\frac{1}{9}\sum_{k=1}^{10}|Z^{\Delta_{t}}_{0,k}-Z^{\Delta_{t}}_{0% }|^{2}}$

for some chosen schemes in Table 1. We observe that the second-order scheme $(\theta_{1}=\frac{1}{2},\theta_{2}=\frac{1}{2},\theta_{3}=\frac{1}{2})$ gives a very small error for quite a large $M$, although $N_{T}=2$. This is to say that the linear example is oversimplified in the scheme $(\theta_{1}=\frac{1}{2},\theta_{2}=\frac{1}{2},\theta_{3}=\frac{1}{2})$ with the tree-based regression and can thus mislead in terms of efficiency. Therefore, in order to obtain reliable numerical convergence rates for our proposed tree-based approach, we will consider $(\theta_{1}=\frac{1}{2},\theta_{2}=1,\theta_{3}=\frac{1}{2})$ for the following analysis and all the numerical examples, which provides the smallest errors for both $Y$ and $Z$ in all the schemes except for the scheme $(\theta_{1}=\frac{1}{2},\theta_{2}=\frac{1}{2},\theta_{3}=\frac{1}{2})$. To take this a step further, we now fix $M=\text{200\,000}$ and plot in Figure 2 the absolute error against the number of steps. We see that the scheme converges meaningfully. Figure 2: Absolute error versus NT for M=200 000. (a) 110⁢∑k=110|Y0,kΔt-Y0|. (b) 110⁢∑k=110|Z0,kΔt-Z0|. Figure 3: Convergence rate for (a) the Y-component and (b) the Z-component. Line with filled circle shows (θ1=1, θ2=1, θ3=1). Line with unfilled circle shows (θ1=12, θ2=1, θ3=12).

For the convergence with respect to the time step we refer to Figure 3, where we roughly adjust sample sizes $M$ according to the time partitions. The results shown in Table 2 and Figure 1 are consistent with the statements in Theorem 4.2.

### 5.2 Example of two-dimensional FBSDE

For the two-dimensional FBSDE we consider the Heston stochastic volatility model (Heston 1993), which reads

 \left.\begin{aligned} \displaystyle\mathrm{d}S_{t}&\displaystyle=\mu S_{t}% \mathrm{d}t+\sqrt{\nu_{t}}S_{t}\mathrm{d}W^{S}_{t},\\ \displaystyle\mathrm{d}\nu_{t}&\displaystyle=\kappa_{\nu}(\mu_{\nu}-\nu_{t})% \mathrm{d}t+\sigma_{\nu}\sqrt{\nu_{t}}\mathrm{d}W^{\nu}_{t},\\ \displaystyle\mathrm{d}W^{S}_{t}\mathrm{d}W^{\nu}_{t}&\displaystyle=\rho% \mathrm{d}t,\end{aligned}\right\} (5.3)

where $S_{t}$ is the spot price of the underlying asset and $\nu_{t}$ is the volatility. It is well known that the Heston model (5.3) can be reformulated as

 $\mathrm{d}\bm{X}_{t}=\begin{pmatrix}\mathrm{d}\nu_{t}\\ \mathrm{d}S_{t}\end{pmatrix}=\begin{pmatrix}\kappa_{\nu}(\mu_{\nu}-\nu_{t})\\ \mu S_{t}\end{pmatrix}\mathrm{d}t+\begin{pmatrix}\sigma_{\nu}\sqrt{\nu_{t}}&0% \\ S_{t}\rho\sqrt{\nu_{t}}&S_{t}\sqrt{1-\rho^{2}}\sqrt{\nu_{t}}\end{pmatrix}% \begin{pmatrix}\mathrm{d}\tilde{W}_{t}^{\nu}\\ \mathrm{d}\tilde{W}_{t}^{S}\end{pmatrix},$ (5.4)

where $\tilde{W}_{t}^{\nu}$ and $\tilde{W}_{t}^{S}$ are independent Brownian motions. To find the FBSDE form for the Heston model we consider the following self-financing strategy:

 $\displaystyle\mathrm{d}Y_{t}$ $\displaystyle=a_{t}\mathrm{d}U(t,\nu_{t},S_{t})+b_{t}\mathrm{d}S_{t}+c_{t}% \mathrm{d}P_{t}$ (5.5) $\displaystyle=a_{t}\mathrm{d}U(t,\nu_{t},S_{t})+b_{t}\mathrm{d}S_{t}+\frac{(Y_% {t}-a_{t}U(t,\nu_{t},S_{t})-b_{t}S_{t})}{P_{t}}\mathrm{d}P_{t},$ (5.6)

where $U(t,\nu_{t},S_{t})$ is the value of another option for hedging volatility; $\mathrm{d}P_{t}=rP_{t}\mathrm{d}t$ is used for the risk-free asset; and $a_{t}$, $b_{t}$ and $c_{t}$ are numbers of the option, the underlying asset and the risk-free asset, respectively. We assume that

 $\mathrm{d}U(t,\nu_{t},S_{t})=\eta(t,\nu_{t},S_{t})\mathrm{d}t,$ (5.7)

which can be substituted into (5.6) to obtain

 $\displaystyle-\mathrm{d}Y_{t}$ $\displaystyle=\bigg{(}a_{t}rU(t,\nu_{t},S_{t})-a_{t}\eta(t,\nu_{t},S_{t})-% \frac{(\mu-r)}{\sqrt{1-\rho^{2}}\sqrt{\nu_{t}}}Z_{t,2}-rY_{t}\bigg{)}\mathrm{d% }t-\bm{Z}_{t}\begin{pmatrix}\mathrm{d}\tilde{W}_{t}^{\nu}\\ \mathrm{d}\tilde{W}_{t}^{S}\end{pmatrix},$ (5.8)

with

 $\bm{Z}_{t}=(Z^{1}_{t},Z^{2}_{t})=(a_{t}\sigma_{\nu}\sqrt{\nu_{t}}+b_{t}S_{t}% \rho\sqrt{\nu_{t}},b_{t}S_{t}\sqrt{1-\rho^{2}}\sqrt{\nu_{t}}).$ (5.9)

In the Heston (1993) model, the market price of the volatility risk is assumed to be $\lambda\nu_{t}$. With the notation used in (5.4), the Heston pricing PDE including $\lambda$ reads

 $\displaystyle\frac{\partial V}{\partial t}+rS\frac{\partial V}{\partial S}+(% \kappa_{\nu}(\mu_{\nu}-\nu)-\lambda\nu)\frac{\partial V}{\partial\nu}+\tfrac{1% }{2}\nu S^{2}\frac{\partial^{2}V}{\partial S^{2}}$ $\displaystyle\qquad\qquad\qquad\qquad\qquad\qquad+\rho\sigma_{\nu}\nu S\frac{% \partial^{2}V}{\partial S\partial\nu}+\tfrac{1}{2}\sigma_{\nu}^{2}\nu\frac{% \partial^{2}V}{\partial\nu^{2}}-rV=0.$ (5.10)

The solution of the FBSDE (5.8) is exactly the solution of the Heston PDE (5.10) by choosing $rU(t,\nu_{t},S_{t})-\eta(t,\nu_{t},S_{t})\equiv-\lambda\nu_{t}$. Equations (5.8) and (5.9) can thus be reformulated as

 $\displaystyle\,-\mathrm{d}Y_{t}$ $\displaystyle=\bigg{(}-a_{t}\lambda\nu_{t}-\frac{(\mu-r)}{\sqrt{1-\rho^{2}}% \sqrt{\nu_{t}}}Z_{t}^{2}-rY_{t}\bigg{)}\mathrm{d}t-\bm{Z}_{t}\begin{pmatrix}% \mathrm{d}\tilde{W}_{t}^{\nu}\\ \mathrm{d}\tilde{W}_{t}^{S}\end{pmatrix}$ $\displaystyle=\bigg{(}-\frac{\lambda\sqrt{\nu_{t}}}{\sigma_{\nu}}Z_{t}^{1}+% \bigg{(}\frac{\rho\lambda\sqrt{\nu_{t}}}{\sqrt{1-\rho^{2}}\sigma_{\nu}}-\frac{% (\mu-r)}{\sqrt{1-\rho^{2}}\sqrt{\nu_{t}}}\bigg{)}Z_{t}^{2}-rY_{t}\bigg{)}% \mathrm{d}t-\bm{Z}_{t}\begin{pmatrix}\mathrm{d}\tilde{W}_{t}^{\nu}\\ \mathrm{d}\tilde{W}_{t}^{S}\end{pmatrix},$ (5.11)

with $\bm{Z}_{t}$ defined in (5.9). To the best of our knowledge, this is the first work in the literature to derive the Heston BSDE. Note that the generator in this example could be non-Lipschitz continuous. The European-style option can be replicated by hedging this portfolio. We consider a call option, for example, whose value at time $t$ is identical to the portfolio value $Y_{t}$, and $Y_{T}=\xi=\max(S_{T}-K,0)$. Hence, $Y_{t}$ is the Heston option value $V(t,\nu_{t},S_{t})$ and $\bm{Z}_{t}$ presents the hedging strategies, where

 $Z_{t}^{1}=\frac{\partial V}{\partial\nu}\sigma_{\nu}\sqrt{\nu_{t}}+\frac{% \partial V}{\partial S}S_{t}\rho\sqrt{\nu_{t}}\quad\text{and}\quad Z_{t}^{2}=% \frac{\partial V}{\partial S}S_{t}\sqrt{1-\rho^{2}}\sqrt{\nu_{t}}.$

The parameter values used for this numerical test are

 $\displaystyle K=S_{0}=50,\qquad r=0.03,\qquad\mu=0.05,\qquad\lambda=0,\qquad T% =0.5,$ $\displaystyle\nu_{0}=\mu_{\nu}=0.04,\qquad\kappa_{\nu}=1.9,\qquad\sigma_{\nu}=% 0.1,\qquad\rho=-0.7,$

which give the exact solution $Y_{0}=3.1825$. The forward processes $\mathrm{d}S_{t}$ and $\mathrm{d}\nu_{t}$ are simulated using the Euler method, and for the final values at the maturity $T$ we take

 \left.\begin{aligned} \displaystyle Y^{\Delta_{t}}_{N_{T},\mathcal{M}}&% \displaystyle=\max(S^{\Delta_{t}}_{N_{T},\mathcal{M}}-K,0),\\ \\ \displaystyle Z^{1,\Delta_{t}}_{N_{T},\mathcal{M}}&\displaystyle=\begin{cases}% S^{\Delta_{t}}_{N_{T},\mathcal{M}}\rho\sqrt{\nu^{\Delta_{t}}_{N_{T},\mathcal{M% }}}&\text{when~{}}S^{\Delta_{t}}_{N_{T},\mathcal{M}}>K,\\ \\ 0&\text{otherwise},\end{cases}\\ \\ \displaystyle Z^{2,\Delta_{t}}_{N_{T},\mathcal{M}}&\displaystyle=\begin{cases}% S^{\Delta_{t}}_{N_{T},\mathcal{M}}\sqrt{1-\rho^{2}}\sqrt{\nu^{\Delta_{t}}_{N_{% T},\mathcal{M}}}&\text{when~{}}S^{\Delta_{t}}_{N_{T},\mathcal{M}}>K,\\ \\ 0&\text{otherwise},\end{cases}\end{aligned}\right\} (5.12)

where $\mathcal{M}=1,\dots,M$. The corresponding relative errors are reported in Table 3. We obtain a fairly accurate approximation for the Heston option price by solving the two-dimensional FBSDE, although the generator is not Lipschitz continuous. It is well known that a splitting scheme of the alternating direction implicit (ADI) type has been widely analyzed and applied to efficiently find the numerical solution of a two-dimensional parabolic PDE. We thus compare our tree-based approach to the Craig–Sneyd Crank–Nicolson ADI finite difference scheme (Craig and Sneyd 1988) for solving the Heston model in Table 3. We denote by $N_{S}$ and $N_{\nu}$ the number of points for the stock price and the volatility grid, respectively. The ADI scheme is performed in the domain $[0,2K]$ for $S$ and $[0,0.5]$ for $\nu$, with a uniform grid $N_{S}=N_{\nu}=40$; the time steps $N_{T}$ are given in Table 3. We can observe that the tree-based approach gives a better (at least, compatible) result.

### 5.3 Examples of nonlinear FBSDE

In this section we test our scheme for nonlinear high-dimensional problems. We find that nonlinear training data may lead to overfitting when directly using the procedure introduced above. Therefore, to avoid overfitting for the nonlinear problems, we propose to control the error while already growing a tree. To do this, we estimate the cross-validation mean squared errors of the trees, which are constructed with a different number of observations in each branch node. Clearly, the best tree (ie, that with the optimal number of observations in each branch node) can be determined by comparing the errors. Theoretically, for the best results, the error control needs to be performed for each time step. However, that would be too computationally expensive. Fortunately, in our test we find that the optimal numbers of observations for each time step are very close to each other. For substantially less computation time, we only need to determine one of these, eg, for the first iteration, and then we need to fix it for all other iterations. We note that the pruning procedure cannot bring considerable improvement when the tree has been grown using the optimal number of observations, and it is thus unnecessary in this case. Figure 4: Errors versus NT for (5.13) in the three-dimensional case. (a) 110⁢∑k=110|Y0,kΔt-Y0|. (b) 110⁢∑k=1101D⁢∑d=1D|Z0,kd,Δt-Z0d|. Figure 5: Errors versus NT for (5.13) in the 100-dimensional case. (a) 110⁢∑k=110|Y0,kΔt-Y0|. (b) 110⁢∑k=1101D⁢∑d=1D|Z0,kd,Δt-Z0d|.

We first consider a modified version of BSDE with quadratically growing derivatives (derived in Gobet and Turkedjiev (2015)), whose explicit solution is known. The BSDE in the 100-dimensional case is analyzed numerically in E et al (2017) and given by

 $-\mathrm{d}Y_{t}=\|Z\|^{2}_{\mathbb{R}^{1\times D}}-\|\nabla\psi(t,W_{t})\|^{2% }_{\mathbb{R}^{D}}-(\partial_{t}+\tfrac{1}{2}\Delta)\psi(t,W_{t})\mathrm{d}t-Z% _{t}\mathrm{d}W_{t},$ (5.13)

with the analytical solution

 \left.\begin{aligned} \displaystyle Y_{t}&\displaystyle=\psi(t,W_{t})=\sin% \bigg{(}\bigg{(}T-t+\frac{1}{D}\|W_{t}\|^{2}_{\mathbb{R}^{D}}\bigg{)}^{\!% \alpha}\bigg{)},\\ \displaystyle Z_{t}&\displaystyle=2\alpha W_{t}^{\mathrm{T}}\cos\bigg{(}\bigg{% (}T-t+\frac{1}{D}\|W_{t}\|^{2}_{\mathbb{R}^{D}}\bigg{)}^{\!\alpha}\bigg{)}% \bigg{(}T-t+\frac{1}{D}\|W_{t}\|^{2}_{\mathbb{R}^{D}}\bigg{)}^{\!\alpha-1},% \end{aligned}\right\} (5.14)

where $\alpha\in(0,\frac{1}{2}]$. Letting $\alpha=0.4$, $T=1$, we test our scheme for $D=3$ and $D=100$ in order to compare our approach with methods in Gobet and Turkedjiev (2015) and E et al (2017). We plot the absolute errors of $Y$ and $Z$ against the number of steps in Figures 4 and 5 for the 3-dimensional and 100-dimensional cases, respectively, where the sample sizes $M$ are roughly adjusted according to the time partitions. We obtain very good numerical results and reach an error of order $10^{-3}$. To show the errors against the sample size $M$, we fix $N_{T}=10$ and report the errors in Table 4, where the convergence in a stable fashion is shown. Further, a better approximation (smaller standard deviations) can always be achieved with larger $M$. Note that the sample-splittings $M_{g}=20\,000$ and $M_{g}=100\,000$ are used for 3-dimensional and 100-dimensional problems, respectively, when $M>50\,000$. Figure 6: Errors versus NT for pricing with different interest rates for the Y-component in the one-dimensional case. Figure 7: Errors versus NT for pricing with different interest rates for the Y-component in the 100-dimensional case.

As another example, we consider a pricing problem of a European option in a financial market with different interest rates for borrowing and lending to hedge the European option. We suppose that $D$ stocks, which are for simplicity assumed to be independent and identically distributed, are driven by

 $\mathrm{d}S_{t,d}=\mu S_{t,d}\mathrm{d}t+\sigma S_{t,d}\mathrm{d}W_{t,d},\quad d% =1,\dots,D,$ (5.15)

where $\sigma>0$ and $\mu\in\mathbb{R}$. The terminal condition and generator for the option pricing with different interest rate are given by

 $Y_{T}=\xi=\max\Big{(}\max_{d=1,\dots,D}(S_{T,d})-K_{1},0\Big{)}-2\max\Big{(}% \max_{d=1,\dots,D}(S_{T,d})-K_{2},0\Big{)}$ (5.16)

and

 $f(t,x,y,z)=-R^{l}y-\frac{\mu-R^{l}}{\sigma}\sum_{d=1}^{D}z_{d}+(R^{b}-R^{l})% \max\bigg{(}0,\frac{1}{\sigma}\sum_{d=1}^{D}z_{d}-y\bigg{)},$ (5.17)

respectively, where $R^{b},R^{l}$ are different interest rates and $K_{1}$, $K_{2}$ are strikes. Obviously, (5.16) and (5.17) are both nonlinear.

We first consider a one-dimensional case, in which we use $Y_{T}=\xi=\max(S_{T}-100,0)$ instead of (5.16) to agree with the setting in Gobet et al (2005) and E et al (2017). The parameter values are set as $T=0.5$, $\mu=0.06$, $\sigma=0.02$, $\smash{R^{l}=0.04}$ and $\smash{R^{b}=0.06}$. We use $Y_{0}=7.156$, computed using the finite-difference method, as the reference price. Note that the reference price is confirmed in Gobet et al (2005) as well. As in the previous example, we plot in Figure 6 the relative error against the number of steps, which gives very good numerical results. In Table 5 we compare our results with the results in E et al (2019, Table 5), and we show that the tree-based approach with $N_{T}=10$ can reach the accuracy level of the multilevel Monte Carlo with seven Picard iterations in significantly less computational time. Finally, we test our scheme for the 100-dimensional nonlinear pricing problem. Figure 7 compares the errors against the number of steps $N_{T}$. In Table 6 we compare our results with those in E et al (2019, Table 6). The reference price $Y_{0}=21.2988$ is computed using the multilevel Monte Carlo with seven Picard iterations, whereas $K_{1}=120$, $K_{2}=150$ and values of other parameters are the same as those for the one-dimensional case. For a comparison of computational cost in terms of the number of splitting groups, we set $M_{g}=50\,000$ for $M>50\,000$, ie, there are twice as many splitting groups as in the latter 100-dimensional example. We see that our result with $N_{T}=10,M=10\,000$ is already better than the approximation of multilevel Monte Carlo with six iterations for less computational time. Further, compared with the results in Table 4, the computational expenses are substantially reduced since the samples are split into more groups. Note that the same reference price is used to compare the deep-learning-based numerical methods for high-dimensional BSDEs in E et al (2017, Table 3), which achieved a relative error of 0.0039 in a run time of 566 seconds.

## 6 Conclusion

In this paper we have studied ways of solving forward–backward stochastic differential equations numerically using regression-tree-based methods. We showed how to use the regression tree to approximate the conditional expectations arising by discretizing the time integrands using the general theta-discretization method. We performed several numerical experiments for different types of (F)BSDEs, including their application to 100-dimensional nonlinear pricing problems. Our numerical results are quite promising and indicate that the tree-based approach is very attractive for solving high-dimensional nonlinear (F)BSDEs.

## Declaration of interest

The author reports no conflicts of interest. The author alone is responsible for the content and writing of the paper.

## Acknowledgements

The author thanks the referees for their valuable suggestions, which greatly improved the paper.

## References

• Bender, C., and Steiner, J. (2012). Least-squares Monte Carlo for backward SDEs. Numerical Methods in Finance 12, 257–289 (https://doi.org/10.1007/978-3-642-25746-9_8).
• Bouchard, B., and Touzi, N. (2004). Discrete-time approximation and Monte Carlo simulation of backward stochastic differential equations. Stochastic Processes and Their Applications 111, 175–206 (https://doi.org/10.1016/j.spa.2004.01.001).
• Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. (1984). Classification and Regression Trees. Taylor & Francis.
• Craig, I. J. D., and Sneyd, A. D. (1988). An alternating-direction implicit scheme for parabolic equations with mixed derivatives. Computers and Mathematics with Applications 16, 341–350 (https://doi.org/10.1016/0898-1221(88)90150-2).
• Crisan, D., and Manolarakis, K. (2010). Solving backward stochastic differential equations using the cubature method: application to nonlinear pricing. SIAM Journal on Financial Mathematics 3(1), 534–571 (https://doi.org/10.1137/090765766).
• E, W., Han, J., and Jentzen, A. (2017). Deep learning-based numerical methods for high-dimensional parabolic partial differential equations and backward stochastic differential equations. Communications in Mathematics and Statistics 5(4), 349–380 (https://doi.org/10.1007/s40304-017-0117-6).
• E, W., Hutzenthaler, M., Jentzen, A., and Kruse, T. (2019). On multilevel picard numerical approximations for high-dimensional nonlinear parabolic partial differential equations and high-dimensional nonlinear backward stochastic differential equations. Journal of Scientific Computing 79, 1534–1571 (https://doi.org/10.1007/s10915-018-00903-0).
• Gobet, E., and Turkedjiev, P. (2015). Linear regression MDP scheme for discrete backward stochastic differential equations under general conditions. Mathematics of Computation 85(299), 1359–1391 (https://doi.org/10.1090/mcom/3013).
• Gobet, E., Lemor, J. P., and Warin, X. (2005). A regression-based Monte Carlo method to solve backward stochastic differential equations. Annals of Applied Probability 15(3), 2172–2202 (https://doi.org/10.1214/105051605000000412).
• Heston, S. L. (1993). A closed-form solution for options with stochastic volatility with applications to bond and currency options. Review of Financial Studies 6(2), 327–343 (https://doi.org/10.1093/rfs/6.2.327).
• Karoui, N. E., Kapoudjan, C., Pardoux, E., Peng, S., and Quenez, M. C. (1997a). Reflected solutions of backward stochastic differential equations and related obstacle problems for PDEs. Annals of Probability 25, 702–737.
• Karoui, N. E., Peng, S., and Quenez, M. C. (1997b). Backward stochastic differential equations in finance. Mathematical Finance 7(1), 1–71 (https://doi.org/10.1111/1467-9965.00022).
• Lemor, J., Gobet, E., and Warin, X. (2006). Rate of convergence of an empirical regression method for solving generalized backward stochastic differential equations. Bernoulli 12, 889–916 (https://doi.org/10.3150/bj/1161614951).
• Lepeltier, J. P., and Martin, J. S. (1997). Backward stochastic differential equations with continuous generator. Statistics and Probability Letters 32, 425–430 (https://doi.org/10.1016/S0167-7152(96)00103-4).
• Li, Y., Yang, J., and Zhao, W. (2017). Convergence error estimates of the Crank–Nicolson scheme for solving decoupled FBSDEs. Science China Mathematics 60, 923–948 (https://doi.org/10.1007/s11425-016-0178-8).
• Ma, J., and Zhang, J. (2005). Representations and regularities for solutions to BSDEs with reflections. Stochastic Processes and Their Applications 115, 539–569 (https://doi.org/10.1016/j.spa.2004.05.010).
• Ma, J., Protter, P., and Yong, J. (1994). Solving forward–backward stochastic differential equations explicity: a four step scheme. Probability Theory and Related Fields 98(3), 339–359 (https://doi.org/10.1007/BF01192258).
• Ma, J., Protter, P., Martín, J. S., and Torres, S. (2002). Numerical method for backward stochastic differential equations. Annals of Applied Probability 12, 302–316 (https://doi.org/10.1214/aoap/1015961165).
• Martinez, W. L., and Martinez, A. R. (2007). Computational Statistics Handbook with MATLAB, 2nd edn. CRC Press, Boca Raton, FL (https://doi.org/10.1201/b13622).
• Pardoux, E., and Peng, S. (1990). Adapted solution of a backward stochastic differential equations. System and Control Letters 14, 55–61 (https://doi.org/10.1016/0167-6911(90)90082-6).
• Peng, S. (1991). Probabilistic interpretation for systems of quasilinear parabolic partial differential equations. Stochastics and Stochastic Reports 37(1–2), 61–74 (https://doi.org/10.1080/17442509108833727).
• Ruijter, M. J., and Oosterlee, C. W. (2015). A Fourier cosine method for an efficient computation of solutions to BSDEs. SIAM Journal on Scientific Computing 37(2), A859–A889 (https://doi.org/10.1137/130913183).
• Teng, L., Lapitckii, A., and Günther, M. (2020). A multi-step scheme based on cubic spline for solving backward stochastic differential equations. Applied Numerical Mathematics 150, 117–138 (https://doi.org/10.1016/j.apnum.2019.09.016).
• Zhao, W., Chen, L., and Peng, S. (2006). A new kind of accurate numerical method for backward stochastic differential equations. SIAM Journal on Scientific Computing 28(4), 1563–1581 (https://doi.org/10.1137/05063341X).
• Zhao, W., Wang, J., and Peng, S. (2009). Error estimates of the $\theta$-scheme for backward stochastic differential equations. Discrete and Continuous Dynamical Systems B 12, 905–924 (https://doi.org/10.3934/dcdsb.2009.12.905).
• Zhao, W., Zhang, G., and Ju, L. (2010). A stable multistep scheme for solving backward stochastic differential equations. SIAM Journal on Numerical Analysis 48, 1369–1394 (https://doi.org/10.1137/09076979X).
• Zhao, W., Li, Y., and Zhang, G. (2012). A generalized $\theta$-scheme for solving backward stochastic differential equations. Discrete and Continuous Dynamical Systems B 17(5), 1585–1603 (https://doi.org/10.3934/dcdsb.2012.17.1585).
• Zhao, W., Li, Y., and Ju, L. (2013). Error estimates of the Crank–Nicolson scheme for solving backward stochastic differential equations. International Journal of Numerical Analysis and Modeling 10(4), 876–898.
• Zhao, W., Zhang, W., and Ju, L. (2014). A numerical method and its error estimates for the decoupled forward–backward stochastic differential equations. Communications in Computational Physics 15, 618–646 (https://doi.org/10.4208/cicp.280113.190813a).

Only users who have a paid subscription or are part of a corporate subscription are able to print or copy content.

To access these options, along with all other subscription benefits, please contact [email protected] or view our subscription options here: http://subscriptions.risk.net/subscribe

You are currently unable to copy this content. Please contact [email protected] to find out more.

#### More papers in this issue

You need to sign in to use this feature. If you don’t have a Risk.net account, please register for a trial.

##### You are currently on corporate access.

To use this feature you will need an individual account. If you have one already please sign in.

.

Alternatively you can request an individual account here: