Learning the optimal policy

Miquel Noguer i Alonso, Daniel Bloch and David Pacheco Aznar

Contents

Preface

Introduction

Markov decision problems

Learning the optimal policy

Reinforcement learning revisited

Temporal difference learning revisited

Stochastic approximation in Markov decision processes

Large language models: reasoning and reinforcement learning

Deep reinforcement learning

Applications of artificial intelligence in finance

10.

Pricing options with temporal difference backpropagation

11.

Pricing American options

12.

Daily price limits

13.

Portfolio optimisation

Appendix

3.1 OVERVIEW

In Section 2 we defined a Markov decision process (MDP) as a 4-tuple (S, R, P, γ) comprising: a function R(s, a) that returns the reward received for taking action a in state s; a transition probability function P(s^′ | s, a) specifying the probability that the environment will transition to state s^′ if the agent takes action a in state s; and the probability π(s, a) of taking certain actions from certain states. The MDP is therefore a probabilistic model.

The goal of reinforcement learning (RL) is to find a policy π that maximises the expected future (discounted) reward of an agent. To do so we can learn

- the optimal policy π^∗ of that agent, or
- the optimal value function V^∗(s).

This is because we can always derive π^∗ from V^∗(s) and vice versa, since the optimal policy satisfies s → a^∗(s), where the optimal action a^∗(s) is defined in Equation 2.18. That is, learning optimal actions for all states simultaneously means learning a policy, which is our objective.

3.1.1 The RL problem

Knowing all elements of a Markov decision process (MDP), we can just compute the solution before we ever actually execute an action in the environment. In machine learning (ML), we typically call

As outlined in our terms and conditions, https://www.infopro-digital.com/terms-and-conditions/subscriptions/ (point 2.4), printing is limited to a single copy.

If you would like to purchase additional rights please email info@risk.net

You may share this content using our article tools. As outlined in our terms and conditions, https://www.infopro-digital.com/terms-and-conditions/subscriptions/ (clause 2.4), an Authorised User may only make one copy of the materials for their own personal use. You must also comply with the restrictions in clause 2.5.

If you would like to purchase additional rights please email info@risk.net