メインコンテンツに移動

Temporal difference learning revisited

Miquel Noguer i Alonso, Daniel Bloch and David Pacheco Aznar

We have presented temporal difference (TD) procedures as a way of solving the multi-step prediction problem with a linear function approximation. In Chapter 4, we revisited the TD procedures in light of the Bellman equations and their operators. However, the initial TD procedures introduced by Sutton (1988) were not derived by directly optimising some objective function. The literature on TD methods has mainly ignored the problem of convergence to the true solution, apart from articles by Barnard (1993), who showed that TD(λ) methods were not true gradient descent methods, resulting in narrow convergence and instability, and by Baird (1995), who showed that these methods could not guarantee convergence when they were used with off-policy training. Several non-gradient descent approaches to this problem have been developed, but none have been completely satisfactory. For example, Bradtke and Barto (1996) introduced least squares temporal difference (LSTD) as a second-order method to guarantee stability, but at high computational cost. The theoretical understanding of the optimisation objective in both the linear and non-linear function approximation settings came later. For instance

Sorry, our subscription options are not loading right now

Please try again later. Get in touch with our customer services team if this issue persists.

New to Risk.net? View our subscription options

無料メンバーシップの内容をお知りになりたいですか?ここをクリック

パスワードを表示
パスワードを非表示にする

You need to sign in to use this feature. If you don’t have a Risk.net account, please register for a trial.

ログイン
You are currently on corporate access.

To use this feature you will need an individual account. If you have one already please sign in.

Sign in.

Alternatively you can request an individual account here