Learning the optimal policy
Miquel Noguer i Alonso, Daniel Bloch and David Pacheco Aznar
Learning the optimal policy
Preface
Introduction
Markov decision problems
Learning the optimal policy
Reinforcement learning revisited
Temporal difference learning revisited
Stochastic approximation in Markov decision processes
Large language models: reasoning and reinforcement learning
Deep reinforcement learning
Applications of artificial intelligence in finance
Pricing options with temporal difference backpropagation
Pricing American options
Daily price limits
Portfolio optimisation
Appendix
3.1 OVERVIEW
In Section 2 we defined a Markov decision process (MDP) as a 4-tuple (S, R, P, γ) comprising: a function R(s, a) that returns the reward received for taking action a in state s; a transition probability function P(s′ | s, a) specifying the probability that the environment will transition to state s′ if the agent takes action a in state s; and the probability π(s, a) of taking certain actions from certain states. The MDP is therefore a probabilistic model.
The goal of reinforcement learning (RL) is to find a policy π that maximises the expected future (discounted) reward of an agent. To do so we can learn
-
the optimal policy π∗ of that agent, or
-
the optimal value function V∗(s).
This is because we can always derive π∗ from V∗(s) and vice versa, since the optimal policy satisfies s → a∗(s), where the optimal action a∗(s) is defined in Equation 2.18. That is, learning optimal actions for all states simultaneously means learning a policy, which is our objective.
3.1.1 The RL problem
Knowing all elements of a Markov decision process (MDP), we can just compute the solution before we ever actually execute an action in the environment. In machine learning (ML), we typically call
Copyright Infopro Digital Limited. All rights reserved.
As outlined in our terms and conditions, https://www.infopro-digital.com/terms-and-conditions/subscriptions/ (point 2.4), printing is limited to a single copy.
If you would like to purchase additional rights please email info@risk.net
Copyright Infopro Digital Limited. All rights reserved.
You may share this content using our article tools. As outlined in our terms and conditions, https://www.infopro-digital.com/terms-and-conditions/subscriptions/ (clause 2.4), an Authorised User may only make one copy of the materials for their own personal use. You must also comply with the restrictions in clause 2.5.
If you would like to purchase additional rights please email info@risk.net