メインコンテンツに移動

Learning the optimal policy

Miquel Noguer i Alonso, Daniel Bloch and David Pacheco Aznar

3.1 OVERVIEW

In Section 2 we defined a Markov decision process (MDP) as a 4-tuple (S, R, P, γ) comprising: a function R(s, a) that returns the reward received for taking action a in state s; a transition probability function P(s | s, a) specifying the probability that the environment will transition to state s if the agent takes action a in state s; and the probability π(s, a) of taking certain actions from certain states. The MDP is therefore a probabilistic model.

The goal of reinforcement learning (RL) is to find a policy π that maximises the expected future (discounted) reward of an agent. To do so we can learn

  •  
    • the optimal policy π of that agent, or

  •  
    • the optimal value function V(s).

This is because we can always derive π from V(s) and vice versa, since the optimal policy satisfies s → a(s), where the optimal action a(s) is defined in Equation 2.18. That is, learning optimal actions for all states simultaneously means learning a policy, which is our objective.

3.1.1 The RL problem

Knowing all elements of a Markov decision process (MDP), we can just compute the solution before we ever actually execute an action in the environment. In machine learning (ML), we typically call

Sorry, our subscription options are not loading right now

Please try again later. Get in touch with our customer services team if this issue persists.

New to Risk.net? View our subscription options

無料メンバーシップの内容をお知りになりたいですか?ここをクリック

パスワードを表示
パスワードを非表示にする

You need to sign in to use this feature. If you don’t have a Risk.net account, please register for a trial.

ログイン
You are currently on corporate access.

To use this feature you will need an individual account. If you have one already please sign in.

Sign in.

Alternatively you can request an individual account here