Large language models: reasoning and reinforcement learning

Miquel Noguer i Alonso, Daniel Bloch and David Pacheco Aznar

Contents

Preface

Introduction

Markov decision problems

Learning the optimal policy

Reinforcement learning revisited

Temporal difference learning revisited

Stochastic approximation in Markov decision processes

Large language models: reasoning and reinforcement learning

Deep reinforcement learning

Applications of artificial intelligence in finance

10.

Pricing options with temporal difference backpropagation

11.

Pricing American options

12.

Daily price limits

13.

Portfolio optimisation

Appendix

7.1 REINFORCEMENT LEARNING AND LARGE LANGUAGE MODELS

This chapter covers the basics of reinforcement learning (RL) in large language models (LLMs). The integration of RL into LLMs typically involves several key steps.

Pretraining. The LLM is first pretrained on a vast corpus of text data to learn the basics of language understanding and generation.
Supervised fine-tuning. The model then undergoes supervised learning, where it is fine-tuned on a data set designed to reflect specific tasks or preferences, often based on manually curated input–output pairs.
Human feedback collection. Humans review the model’s outputs in various contexts and provide feedback, which can involve rating the outputs or selecting preferred ones from pairs of options.
Reinforcement learning. The feedback collected is used to define a reward function, which measures the alignment of the model’s outputs with human values. The model is further fine-tuned through RL, with the aim of maximising this reward function. Two good techniques are reinforcement learning from human feedback (RLHF) Christiano et al (2017) and reinforcement learning from artificial intelligence (AI) feedback (RLAIF) Lee et al (2023).

7.1.1

As outlined in our terms and conditions, https://www.infopro-digital.com/terms-and-conditions/subscriptions/ (point 2.4), printing is limited to a single copy.

If you would like to purchase additional rights please email info@risk.net

You may share this content using our article tools. As outlined in our terms and conditions, https://www.infopro-digital.com/terms-and-conditions/subscriptions/ (clause 2.4), an Authorised User may only make one copy of the materials for their own personal use. You must also comply with the restrictions in clause 2.5.

If you would like to purchase additional rights please email info@risk.net