メインコンテンツに移動

Large language models: reasoning and reinforcement learning

Miquel Noguer i Alonso, Daniel Bloch and David Pacheco Aznar

7.1 REINFORCEMENT LEARNING AND LARGE LANGUAGE MODELS

This chapter covers the basics of reinforcement learning (RL) in large language models (LLMs). The integration of RL into LLMs typically involves several key steps.

  1. Pretraining. The LLM is first pretrained on a vast corpus of text data to learn the basics of language understanding and generation.

  2. Supervised fine-tuning. The model then undergoes supervised learning, where it is fine-tuned on a data set designed to reflect specific tasks or preferences, often based on manually curated input–output pairs.

  3. Human feedback collection. Humans review the model’s outputs in various contexts and provide feedback, which can involve rating the outputs or selecting preferred ones from pairs of options.

  4. Reinforcement learning. The feedback collected is used to define a reward function, which measures the alignment of the model’s outputs with human values. The model is further fine-tuned through RL, with the aim of maximising this reward function. Two good techniques are reinforcement learning from human feedback (RLHF) Christiano et al (2017) and reinforcement learning from artificial intelligence (AI) feedback (RLAIF) Lee et al (2023).

7.1.1

Sorry, our subscription options are not loading right now

Please try again later. Get in touch with our customer services team if this issue persists.

New to Risk.net? View our subscription options

無料メンバーシップの内容をお知りになりたいですか?ここをクリック

パスワードを表示
パスワードを非表示にする

You need to sign in to use this feature. If you don’t have a Risk.net account, please register for a trial.

ログイン
You are currently on corporate access.

To use this feature you will need an individual account. If you have one already please sign in.

Sign in.

Alternatively you can request an individual account here