Large language models: reasoning and reinforcement learning
Miquel Noguer i Alonso, Daniel Bloch and David Pacheco Aznar
Preface
Introduction
Markov decision problems
Learning the optimal policy
Reinforcement learning revisited
Temporal difference learning revisited
Stochastic approximation in Markov decision processes
Large language models: reasoning and reinforcement learning
Deep reinforcement learning
Applications of artificial intelligence in finance
Pricing options with temporal difference backpropagation
Pricing American options
Daily price limits
Portfolio optimisation
Appendix
7.1 REINFORCEMENT LEARNING AND LARGE LANGUAGE MODELS
This chapter covers the basics of reinforcement learning (RL) in large language models (LLMs). The integration of RL into LLMs typically involves several key steps.
Pretraining. The LLM is first pretrained on a vast corpus of text data to learn the basics of language understanding and generation.
Supervised fine-tuning. The model then undergoes supervised learning, where it is fine-tuned on a data set designed to reflect specific tasks or preferences, often based on manually curated input–output pairs.
Human feedback collection. Humans review the model’s outputs in various contexts and provide feedback, which can involve rating the outputs or selecting preferred ones from pairs of options.
Reinforcement learning. The feedback collected is used to define a reward function, which measures the alignment of the model’s outputs with human values. The model is further fine-tuned through RL, with the aim of maximising this reward function. Two good techniques are reinforcement learning from human feedback (RLHF) Christiano et al (2017) and reinforcement learning from artificial intelligence (AI) feedback (RLAIF) Lee et al (2023).
7.1.1
Copyright Infopro Digital Limited. All rights reserved.
As outlined in our terms and conditions, https://www.infopro-digital.com/terms-and-conditions/subscriptions/ (point 2.4), printing is limited to a single copy.
If you would like to purchase additional rights please email info@risk.net
Copyright Infopro Digital Limited. All rights reserved.
You may share this content using our article tools. As outlined in our terms and conditions, https://www.infopro-digital.com/terms-and-conditions/subscriptions/ (clause 2.4), an Authorised User may only make one copy of the materials for their own personal use. You must also comply with the restrictions in clause 2.5.
If you would like to purchase additional rights please email info@risk.net