RLHF and DPO Fine-Tuning Theory: From Basic Concepts to Advanced Implementation with Llama Models

RAHUL KUMAR
Sep 12
12 min read

Introduction

Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) represent the cutting edge of language model alignment, transforming how AI systems learn to behave according to human preferences. These techniques go beyond traditional supervised fine-tuning by incorporating human judgment and preference data to create models that are not just accurate, but genuinely helpful, safe, and aligned with human values.

This comprehensive guide explores the theoretical foundations, mathematical principles, and practical implementations of both RLHF and DPO, with specific focus on fine-tuning Llama models. Whether you're preparing for technical interviews or building production AI systems, mastering these alignment techniques will position you at the forefront of responsible AI development.

Understanding the Alignment Challenge

Why Traditional Fine-Tuning Isn't Enough

Traditional supervised fine-tuning (SFT) teaches models to imitate patterns in training data, but it cannot capture the subjective, contextual nature of human preferences. Consider these examples:

Task: Write a customer service response to a complaint.

SFT Approach: Train on thousands of customer service transcriptsProblem: The model learns to mimic responses but doesn't understand what makes one response better than another in terms of empathy, helpfulness, or appropriateness.

RLHF/DPO Approach: Show the model multiple response options and teach it which ones humans prefer and whyAdvantage: The model learns the subtle qualities that make responses genuinely helpful rather than just formally correct.

The Subjectivity Problem

Many AI tasks involve subjective judgments that cannot be captured by traditional loss functions:

Content Tone: Is this response too formal or too casual?
Safety Considerations: Does this response encourage harmful behavior?
Helpfulness: Does this response actually solve the user's problem?
Cultural Sensitivity: Is this response appropriate across different cultural contexts?

These nuanced qualities require human judgment, making preference-based learning essential for creating truly aligned AI systems.

Reinforcement Learning from Human Feedback (RLHF):

The Foundation

RLHF Conceptual Framework

RLHF transforms human preferences into a reward signal that guides model training through reinforcement learning. The process mirrors how humans learn from feedback: we try different approaches, receive feedback on our performance, and adjust our behavior to maximize positive outcomes.

Think of RLHF like teaching a student to write essays. Instead of just showing them good essays (supervised learning), you give them feedback on their drafts: "This paragraph is too vague," "This argument is compelling," "This conclusion doesn't follow from your evidence." The student learns to internalize these preferences and applies them to future writing.

The Three-Stage RLHF Pipeline

Stage 1: Supervised Fine-Tuning (SFT) Foundation

Purpose: Establish basic task competence and instruction-following capabilities.

The SFT stage adapts a pre-trained language model to follow instructions and respond appropriately to prompts. This creates a stable foundation for subsequent preference learning.

Key Characteristics:

Uses high-quality human-written prompt-response pairs
Establishes consistent formatting and response style
Reduces harmful or inappropriate outputs from the base model
Creates a reference point for measuring preference-based improvements

Stage 2: Reward Model Training

Purpose: Create a computational proxy for human preferences.

The reward model learns to predict human preferences by training on comparison data where humans rank different responses to the same prompt.

Training Process:

Data Collection: Present human annotators with prompt-response pairs and ask them to rank responses by quality
Comparison Dataset: Create datasets of (prompt, response_A, response_B, preference) tuples
Model Training: Train a model to predict which response humans will prefer
Validation: Test the reward model's predictions against held-out human preferences

Mathematical Foundation:The reward model uses the Bradley-Terry model to convert pairwise comparisons into scalar rewards:

P(y1≻y2∣x)=exp⁡(r(x,y1))exp⁡(r(x,y1))+exp⁡(r(x,y2))P(y1≻y2∣x)=exp(r(x,y1))+exp(r(x,y2))exp(r(x,y1))

Where:

xx is the prompt
y1,y2y1,y2 are two different responses
r(x,y)r(x,y) is the reward function
y1≻y2y1≻y2 means y1y1 is preferred over y2y2

Stage 3: Policy Optimization with PPO

Purpose: Update the language model to maximize rewards while maintaining stability.

This stage uses Proximal Policy Optimization (PPO) to adjust the model's parameters based on reward model feedback, creating the final aligned model.

The Bradley-Terry Model Deep Dive

The Bradley-Terry model provides the mathematical foundation for converting human preference comparisons into trainable reward signals.

Mathematical Intuition

Consider two chess players with Elo ratings 2000 and 1600. The Bradley-Terry model predicts that the stronger player will win with probability:

P(Player A wins)=20002000+1600=0.556P(Player A wins)=2000+16002000=0.556

In RLHF, we replace player ratings with response quality scores, allowing us to predict which response humans will prefer.

Application to Language Models

For language model responses, the Bradley-Terry model becomes:

P(Response A preferred)=exp⁡(r(prompt, Response A))exp⁡(r(prompt, Response A))+exp⁡(r(prompt, Response B))P(Response A preferred)=exp(r(prompt, Response A))+exp(r(prompt, Response B))exp(r(prompt, Response A))

This formulation allows us to:

Convert subjective preferences into objective training signals
Handle the inherent uncertainty in human preferences
Scale preference learning to large datasets

Proximal Policy Optimization (PPO): The Engine of RLHF

Understanding PPO's Role

PPO serves as the optimization engine that updates the language model based on reward model feedback. Unlike supervised learning, which has clear target outputs, PPO must balance exploration (trying new responses) with exploitation (improving known good responses) while maintaining training stability.

The Policy Update Challenge

Traditional policy gradient methods can make dramatic parameter updates that destabilize training. PPO solves this through clipped objectives that limit how much the policy can change in a single update.

The PPO Objective Function

PPO's core innovation is the clipped surrogate objective:

LCLIP(θ)=Et[min⁡(rt(θ)A^t,clip(rt(θ),1−ϵ,1+ϵ)A^t)]LCLIP(θ)=Et[min(rt(θ)A^t,clip(rt(θ),1−ϵ,1+ϵ)A^t)]

Where:

rt(θ)=πθ(at∣st)πθold(at∣st)rt(θ)=πθold(at∣st)πθ(at∣st) is the policy ratio
A^tA^t is the advantage estimate (how much better this action is than average)
ϵϵ is the clipping parameter (typically 0.2)

Intuitive Understanding of Clipping

The clipping mechanism works like a safety governor on an engine:

When rt(θ)>1+ϵrt(θ)>1+ϵ: The new policy assigns much higher probability to this action than the old policy. Clipping prevents excessive updates.

When rt(θ)<1−ϵrt(θ)<1−ϵ: The new policy assigns much lower probability to this action. Clipping prevents the action from being completely eliminated.

When 1−ϵ≤rt(θ)≤1+ϵ1−ϵ≤rt(θ)≤1+ϵ: Normal gradient updates proceed without modification.

PPO in the RLHF Context

def ppo_loss(old_logprobs, new_logprobs, advantages, epsilon=0.2):

"""

Compute PPO clipped objective loss

Args:

old_logprobs: Log probabilities from the previous policy

new_logprobs: Log probabilities from the current policy

advantages: Advantage estimates from the reward model

epsilon: Clipping parameter

"""

# Compute probability ratios

ratios = torch.exp(new_logprobs - old_logprobs)

# Compute surrogate losses

surr1 = ratios * advantages

surr2 = torch.clamp(ratios, 1 - epsilon, 1 + epsilon) * advantages

# Take minimum for conservative updates

ppo_loss = -torch.min(surr1, surr2).mean()

return ppo_loss

RLHF Training Loop

The complete RLHF training process integrates all components:

def rlhf_training_step(policy_model, reward_model, reference_model, prompts):

"""

Single RLHF training iteration using PPO

"""

# 1. Generate responses with current policy

with torch.no_grad():

responses = policy_model.generate(prompts, do_sample=True)

old_logprobs = policy_model(prompts, responses).logprobs

# 2. Compute rewards using reward model

rewards = reward_model(prompts, responses)

# 3. Compute KL penalty to prevent drift from reference model

ref_logprobs = reference_model(prompts, responses).logprobs

kl_penalty = torch.nn.functional.kl_div(old_logprobs, ref_logprobs)

# 4. Combine rewards with KL penalty

total_rewards = rewards - 0.1 * kl_penalty

# 5. Compute advantages (how much better than expected)

advantages = compute_advantages(total_rewards)

# 6. PPO optimization steps

for _ in range(ppo_epochs):

new_logprobs = policy_model(prompts, responses).logprobs

ppo_loss = ppo_loss_function(old_logprobs, new_logprobs, advantages)

# Backpropagate and update

ppo_loss.backward()

optimizer.step()

Direct Preference Optimization (DPO): The Elegant Alternative

The DPO Breakthrough

DPO revolutionizes preference learning by eliminating the need for explicit reward models and reinforcement learning. Instead of the complex three-stage RLHF pipeline, DPO directly optimizes the language model on preference data using a simple classification objective.

The key insight of DPO is that the optimal policy for RLHF can be derived in closed form, allowing direct optimization without intermediate reward models.

Mathematical Foundation of DPO

Starting from the RLHF Objective

RLHF seeks to find the optimal policy π∗π∗ that maximizes expected rewards while staying close to a reference policy πrefπref:

π∗=arg⁡max⁡πEx,y∼π[r(x,y)]−βEx∼D[KL(π(⋅∣x)∣∣πref(⋅∣x))]π∗=argmaxπEx,y∼π[r(x,y)]−βEx∼D[KL(π(⋅∣x)∣∣πref(⋅∣x))]

The DPO Reparameterization

DPO shows that this optimal policy can be expressed as:

π∗(y∣x)=1Z(x)πref(y∣x)exp⁡(r∗(x,y)β)π∗(y∣x)=Z(x)1πref(y∣x)exp(βr∗(x,y))

Where Z(x)Z(x) is a normalization constant and r∗(x,y)r∗(x,y) is the optimal reward function.

Crucially, this can be rearranged to express the reward in terms of the policy:

r∗(x,y)=βlog⁡π∗(y∣x)πref(y∣x)+βlog⁡Z(x)r∗(x,y)=βlogπref(y∣x)π∗(y∣x)+βlogZ(x)

The DPO Loss Function

Substituting this back into the Bradley-Terry model and noting that the log⁡Z(x)logZ(x) terms cancel, we get the DPO loss:

LDPO(πθ)=−E(x,yw,yl)∼D[log⁡σ(βlog⁡πθ(yw∣x)πref(yw∣x)−βlog⁡πθ(yl∣x)πref(yl∣x))]LDPO(πθ)=−E(x,yw,yl)∼D[logσ(βlogπref(yw∣x)πθ(yw∣x)−βlogπref(yl∣x)πθ(yl∣x))]

Where:

ywyw is the preferred (winning) response
ylyl is the less preferred (losing) response
σσ is the sigmoid function
ββ controls the strength of the KL penalty

DPO Implementation

def dpo_loss(model, reference_model, prompts, chosen_responses, rejected_responses, beta=0.1):

"""

Compute Direct Preference Optimization loss

Args:

model: The policy model being trained

reference_model: The reference model (typically SFT model)

prompts: Input prompts

chosen_responses: Human-preferred responses

rejected_responses: Human-rejected responses

beta: Temperature parameter controlling KL penalty strength

"""

# Compute log probabilities for chosen responses

chosen_logprobs = model(prompts, chosen_responses).log_probs

chosen_ref_logprobs = reference_model(prompts, chosen_responses).log_probs

# Compute log probabilities for rejected responses

rejected_logprobs = model(prompts, rejected_responses).log_probs

rejected_ref_logprobs = reference_model(prompts, rejected_responses).log_probs

# Compute log ratios (implicit rewards)

chosen_rewards = beta * (chosen_logprobs - chosen_ref_logprobs)

rejected_rewards = beta * (rejected_logprobs - rejected_ref_logprobs)

# Compute DPO loss using Bradley-Terry model

loss = -torch.log(torch.sigmoid(chosen_rewards - rejected_rewards)).mean()

return loss

DPO vs RLHF Comparison

Aspect	RLHF	DPO
Training Stages	3 (SFT → Reward Model → PPO)	2 (SFT → DPO)
Complexity	High (RL training, reward modeling)	Low (supervised learning)
Stability	Challenging (RL instability)	High (stable supervised training)
Data Requirements	Preference data → Reward model → RL	Direct preference optimization
Computational Cost	High (multiple models, sampling)	Lower (single model, no sampling)
Hyperparameter Sensitivity	High (PPO hyperparameters)	Lower (fewer hyperparameters)

Implementing RLHF and DPO with Llama Models

Dataset Preparation for Preference Learning

Both RLHF and DPO require high-quality preference datasets where human annotators compare different responses to the same prompt.

Preference Data Format

# Example preference data structure

preference_example = {

"prompt": "Explain the concept of quantum computing to a high school student.",

"chosen": "Quantum computing uses the strange properties of quantum mechanics to process information in fundamentally new ways. Unlike regular computers that use bits (0 or 1), quantum computers use quantum bits or 'qubits' that can be 0, 1, or both simultaneously. This allows them to explore many possible solutions to a problem at once, making them potentially much faster for certain types of calculations.",

"rejected": "Quantum computing is a computational paradigm that leverages quantum mechanical phenomena such as superposition and entanglement to perform computations on quantum bits or qubits, which can exist in superposed states enabling parallel computation across multiple dimensions of the solution space."

}

The "chosen" response is clearer, more accessible, and better suited for the target audience, while the "rejected" response is overly technical and difficult to understand.

RLHF Implementation with Llama

Stage 1: Supervised Fine-Tuning Setup

from transformers import (

AutoTokenizer,

AutoModelForCausalLM,

TrainingArguments

)

from peft import LoraConfig, get_peft_model

from trl import SFTTrainer

# Load base Llama model with 4-bit quantization

model = AutoModelForCausalLM.from_pretrained(

"meta-llama/Llama-2-7b-hf",

load_in_4bit=True,

device_map="auto",

trust_remote_code=True

)

# Configure LoRA for efficient fine-tuning

lora_config = LoraConfig(

r=16, # Low rank dimension

lora_alpha=32, # LoRA scaling parameter

target_modules=["q_proj", "v_proj"], # Target attention layers

lora_dropout=0.1,

bias="none",

task_type="CAUSAL_LM"

)

model = get_peft_model(model, lora_config)

# SFT training

sft_trainer = SFTTrainer(

model=model,

train_dataset=sft_dataset,

tokenizer=tokenizer,

max_seq_length=512,

args=TrainingArguments(

output_dir="./llama-sft",

num_train_epochs=3,

per_device_train_batch_size=4,

learning_rate=2e-4,

fp16=True

)

sft_trainer.train()

Stage 2: Reward Model Training

from transformers import AutoModelForSequenceClassification

# Load reward model architecture

reward_model = AutoModelForSequenceClassification.from_pretrained(

"meta-llama/Llama-2-7b-hf",

num_labels=1, # Single scalar reward output

load_in_4bit=True

)

def compute_reward_loss(model, batch):

"""

Compute Bradley-Terry loss for reward model training

"""

chosen_rewards = model(batch['chosen_input_ids']).logits

rejected_rewards = model(batch['rejected_input_ids']).logits

# Bradley-Terry loss

loss = -torch.log(torch.sigmoid(chosen_rewards - rejected_rewards)).mean()

return loss

# Train reward model on preference data

reward_trainer = Trainer(

model=reward_model,

train_dataset=preference_dataset,

compute_loss=compute_reward_loss,

args=TrainingArguments(

output_dir="./llama-reward-model",

num_train_epochs=3,

per_device_train_batch_size=4,

learning_rate=5e-5

)

reward_trainer.train()

Stage 3: PPO Training

from trl import PPOTrainer, PPOConfig

# Configure PPO training

ppo_config = PPOConfig(

model_name="./llama-sft",

learning_rate=1e-5,

batch_size=32,

mini_batch_size=4,

ppo_epochs=4,

early_stopping=True,

target_kl=0.1, # KL divergence limit

cliprange=0.2 # PPO clipping parameter

)

# Initialize PPO trainer

ppo_trainer = PPOTrainer(

config=ppo_config,

model=sft_model,

ref_model=reference_model,

tokenizer=tokenizer,

reward_model=reward_model

)

# PPO training loop

for epoch in range(num_epochs):

for batch in dataloader:

# Generate responses

responses = ppo_trainer.generate(batch['prompts'])

# Compute rewards

rewards = [reward_model(prompt, response) for prompt, response in zip(batch['prompts'], responses)]

# PPO update

stats = ppo_trainer.step(batch['prompts'], responses, rewards)

# Logging

print(f"Epoch {epoch}, Reward: {np.mean(rewards):.3f}")

DPO Implementation with Llama

DPO offers a much simpler implementation path compared to full RLHF:

from trl import DPOTrainer

# Load SFT model as starting point

model = AutoModelForCausalLM.from_pretrained("./llama-sft")

reference_model = AutoModelForCausalLM.from_pretrained("./llama-sft")

# Configure DPO training

dpo_trainer = DPOTrainer(

model=model,

ref_model=reference_model,

train_dataset=preference_dataset,

tokenizer=tokenizer,

args=TrainingArguments(

output_dir="./llama-dpo",

num_train_epochs=3,

per_device_train_batch_size=4,

learning_rate=5e-7, # Lower learning rate for stability

fp16=True,

logging_steps=10,

save_steps=500,

beta=0.1, # DPO temperature parameter

max_length=512,

max_prompt_length=256,

)

# Train with DPO

dpo_trainer.train()

Advanced Techniques and Optimizations

Memory-Efficient Training Strategies

Training large language models like Llama requires careful memory management:

Gradient Checkpointing

# Enable gradient checkpointing to trade compute for memory

model.gradient_checkpointing_enable()

# Disable cache during training to save memory

model.config.use_cache = False

DeepSpeed Integration

# DeepSpeed ZeRO configuration for distributed training

deepspeed_config = {

"zero_optimization": {

"stage": 2, # ZeRO stage 2: optimizer state sharding

"offload_optimizer": {

"device": "cpu" # Offload optimizer states to CPU

"contiguous_gradients": True,

"overlap_comm": True

"fp16": {"enabled": True},

"train_batch_size": "auto",

"train_micro_batch_size_per_gpu": "auto"

}

Hyperparameter Optimization

RLHF Hyperparameters

PPO Learning Rate: Start with 1e-5, adjust based on reward progressionKL Penalty: Balance between staying close to reference model (high β) and maximizing rewards (low β)Clipping Parameter: 0.2 is standard, but may need adjustment for different model sizes

DPO Hyperparameters

Beta Parameter: Controls strength of KL penalty (0.1-0.5 typical range)Learning Rate: Much lower than SFT (1e-6 to 1e-5)Batch Size: Larger batches improve stability

Evaluation Strategies

Reward Model Evaluation

def evaluate_reward_model(reward_model, test_dataset):

"""

Evaluate reward model accuracy on held-out preference data

"""

correct_predictions = 0

total_comparisons = 0

for example in test_dataset:

chosen_reward = reward_model(example['prompt'], example['chosen'])

rejected_reward = reward_model(example['prompt'], example['rejected'])

if chosen_reward > rejected_reward:

correct_predictions += 1

total_comparisons += 1

accuracy = correct_predictions / total_comparisons

return accuracy

Policy Model Evaluation

def evaluate_policy_model(model, evaluation_prompts):

"""

Generate responses and compute various quality metrics

"""

responses = []

for prompt in evaluation_prompts:

response = model.generate(

prompt,

max_new_tokens=256,

do_sample=True,

temperature=0.7,

top_p=0.9

)

responses.append(response)

# Compute metrics: helpfulness, safety, coherence

metrics = compute_response_quality(evaluation_prompts, responses)

return metrics

Interview Preparation Guide

Essential Concepts to Master

For RLHF Questions:

Three-Stage Process: Understand SFT → Reward Model → PPO progression
Bradley-Terry Model: Explain how pairwise comparisons convert to rewards
PPO Mechanics: Describe clipped objectives and policy ratio constraints
KL Divergence Role: Explain how KL penalty prevents model collapse

For DPO Questions:

Mathematical Innovation: Understand the closed-form optimal policy derivation
Simplification Benefits: Compare DPO's single-stage vs RLHF's multi-stage training
Implicit Rewards: Explain how DPO computes rewards without explicit reward models
Loss Function: Understand the DPO objective and its connection to Bradley-Terry

Common Interview Questions and Answers

Q: "Explain the fundamental difference between RLHF and DPO."

A: RLHF uses a three-stage process: first train an SFT model, then train a separate reward model on preference data, finally use PPO to optimize the policy against the reward model. DPO eliminates the reward model and PPO stages by deriving the optimal policy in closed form, allowing direct optimization on preference data using a simple classification loss. DPO is mathematically equivalent to RLHF but much simpler to implement and train.

Q: "Why is the Bradley-Terry model important in preference learning?"

A: The Bradley-Terry model provides the mathematical framework for converting pairwise preference comparisons into scalar rewards. It models the probability that humans prefer response A over response B as a function of their relative quality scores. This enables us to train both reward models (in RLHF) and directly optimize policies (in DPO) using preference data rather than absolute quality judgments.

Q: "What role does the KL divergence penalty serve in both RLHF and DPO?"

A: KL divergence prevents the model from drifting too far from the reference policy during optimization. Without this constraint, the model might learn to game the reward function by generating responses that score highly but are actually low quality or nonsensical. The KL penalty ensures the model maintains its basic language capabilities while adapting to human preferences.

Q: "How would you choose between RLHF and DPO for a production system?"

A: Choose DPO for most applications because it's simpler, more stable, and requires less computational overhead while achieving comparable results. Use RLHF when you need online learning capabilities, have very specific reward modeling requirements, or want to leverage sophisticated RL techniques like curiosity-driven exploration. DPO is generally preferred for its simplicity and effectiveness.

Best Practices for Production Deployment

Model Evaluation and Safety

Comprehensive Evaluation: Test aligned models across multiple dimensions including helpfulness, harmlessness, and honesty. Use both automated metrics and human evaluation.

Safety Guardrails: Implement additional safety checks beyond alignment training, including content filtering and response validation.

Continuous Monitoring: Track model behavior in production to detect alignment drift or unexpected behaviors.

Scaling Considerations

Data Quality: Focus on high-quality preference data rather than large quantities. 10,000 high-quality comparisons often outperform 100,000 low-quality ones.

Computational Efficiency: DPO's single-stage training makes it much more practical for regular retraining and updates compared to full RLHF.

Infrastructure Requirements: Plan for significant GPU memory requirements, especially for larger models. Consider using parameter-efficient methods like LoRA for cost-effective training.

Future Directions and Research Trends

Constitutional AI and Self-Supervision

Constitutional AI extends preference learning by training models to critique and improve their own outputs based on a set of principles or "constitution."

Multi-Modal Alignment

Extending RLHF and DPO to multi-modal models that process both text and images, requiring preference data that spans multiple modalities.

Scalable Oversight

Research into training AI systems to assist in the preference annotation process, reducing the human labor required for alignment data collection.

Conclusion

RLHF and DPO represent paradigm shifts in how we train AI systems to align with human values and preferences. While RLHF established the theoretical foundations and demonstrated the power of preference-based learning, DPO's elegant simplification makes high-quality alignment accessible to a broader range of practitioners and applications.

Introduction

Understanding the Alignment Challenge

Why Traditional Fine-Tuning Isn't Enough

The Subjectivity Problem

Reinforcement Learning from Human Feedback (RLHF):

The Foundation

RLHF Conceptual Framework

The Three-Stage RLHF Pipeline

Stage 1: Supervised Fine-Tuning (SFT) Foundation

Stage 2: Reward Model Training

Stage 3: Policy Optimization with PPO

The Bradley-Terry Model Deep Dive

Mathematical Intuition

Application to Language Models

Proximal Policy Optimization (PPO): The Engine of RLHF

Understanding PPO's Role

The Policy Update Challenge

The PPO Objective Function

Intuitive Understanding of Clipping

PPO in the RLHF Context

RLHF Training Loop

Direct Preference Optimization (DPO): The Elegant Alternative

The DPO Breakthrough

Mathematical Foundation of DPO

Starting from the RLHF Objective

The DPO Reparameterization

The DPO Loss Function

DPO Implementation

DPO vs RLHF Comparison

Implementing RLHF and DPO with Llama Models

Dataset Preparation for Preference Learning

Preference Data Format

RLHF Implementation with Llama

Stage 1: Supervised Fine-Tuning Setup

Stage 2: Reward Model Training

Stage 3: PPO Training

DPO Implementation with Llama

Advanced Techniques and Optimizations

Memory-Efficient Training Strategies

Gradient Checkpointing

DeepSpeed Integration

Hyperparameter Optimization

RLHF Hyperparameters

DPO Hyperparameters

Evaluation Strategies

Reward Model Evaluation

Policy Model Evaluation

Interview Preparation Guide

Essential Concepts to Master

Common Interview Questions and Answers

Best Practices for Production Deployment

Model Evaluation and Safety

Scaling Considerations

Future Directions and Research Trends

Constitutional AI and Self-Supervision

Multi-Modal Alignment

Scalable Oversight

Conclusion

Comments