RL Lecture Notes: Sequential Decision Making

Reinforcement Learning - Lecture 1 - Emma Brunskill

Learning to Make Good Sequential Decisions

Key aspects of reinforcement learning:

  • Optimization: Finding policies that maximize expected reward
  • Exploration: Balancing exploration vs exploitation
  • Generalization: Transferring knowledge to new situations
  • Delayed Consequences: Decisions have long-term ramifications

Challenges

When planning: Decisions involve reasoning about not just immediate benefit but also longer-term ramifications.

When learning: Temporal credit assignment is hard - what caused later high or low rewards?

Policy

A policy is a mapping from past experience to action.

Comparison with Other Paradigms

Supervised Learning: Typically making one decision instead of a sequence of decisions.

Imitation Learning: Learns from experience of others, assumes input demos of good policies. Imitation + RL seems promising.

Sequential Decision Making Under Uncertainty

Goal: Maximize the total expected future reward (the world is stochastic so the agent maximizes rewards in expectation).

The Markov Assumption

State \(s_t\) is Markov if and only if:

$$p(s_{t+1} | s_t, a_t) = p(s_{t+1} | h_{t}, a_{t})$$

The current state is a sufficient statistic of history.

Problem Variants

  • Finite horizon vs. infinite horizon
  • Stationary vs. non-stationary