A3C: Asynchronous Advantage Actor-Critic
25 Jan 2019Reading Group: Asynchronous Methods for Deep Reinforcement Learning (Mnih et al.)
Motivation
The sequence of observed data encountered by an online RL agent is non-stationary, and online RL updates are strongly correlated. By storing the agent's data in an experience replay memory, the data can be batched or randomly sampled from different time steps.
Drawbacks of Experience Replay
- Uses more memory and computation per interaction
- Requires off-policy learning algorithms
Asynchronous RL Framework
The paper presents multi-threaded asynchronous variants of:
- One-step SARSA
- One-step Q-learning
- N-step Q-learning
- Advantage actor-critic (A3C)
Key insight: Actor-critic is an on-policy search method while Q-learning is an off-policy value-based method. Running multiple agents in parallel on different threads provides diverse, decorrelated experience without replay memory.
Key Benefits
- Decorrelated updates: Different threads explore different parts of the environment
- No replay memory needed: Enables on-policy methods like actor-critic
- CPU-friendly: Runs on multi-core CPUs rather than requiring GPUs
Related Work
- Gorila Framework: Distributed RL with parameter servers
- Hogwild!: Lock-free parallel SGD