A3C: Asynchronous Advantage Actor-Critic

Reading Group: Asynchronous Methods for Deep Reinforcement Learning (Mnih et al.)

Motivation

The sequence of observed data encountered by an online RL agent is non-stationary, and online RL updates are strongly correlated. By storing the agent's data in an experience replay memory, the data can be batched or randomly sampled from different time steps.

Drawbacks of Experience Replay

  • Uses more memory and computation per interaction
  • Requires off-policy learning algorithms

Asynchronous RL Framework

The paper presents multi-threaded asynchronous variants of:

  • One-step SARSA
  • One-step Q-learning
  • N-step Q-learning
  • Advantage actor-critic (A3C)

Key insight: Actor-critic is an on-policy search method while Q-learning is an off-policy value-based method. Running multiple agents in parallel on different threads provides diverse, decorrelated experience without replay memory.

Key Benefits

  • Decorrelated updates: Different threads explore different parts of the environment
  • No replay memory needed: Enables on-policy methods like actor-critic
  • CPU-friendly: Runs on multi-core CPUs rather than requiring GPUs

Related Work

  • Gorila Framework: Distributed RL with parameter servers
  • Hogwild!: Lock-free parallel SGD