Gut Instinct: Citizen Science and Online Learning

Gut Instinct: Creating Scientific Theories with Online Learning
Our intuition is that scientific crowdsourcing will most usefully contribute to domains where science is nascent and/or highly contextual. The human microbiome project is both. This paper explores the potential of coupling online citizen science with learning materials to create scientific questions. Example: Foldit players discovered protein structures that helped scientists understand how the AIDS virus reproduces. The main contribution of this paper is demonstrating that a crowd of online non-expert learners can collaboratively perform useful scientifc work. Gut Instinct, which brings together learnes to perform useful collaborative brainstorming on a citizen science project while developing expertise. Collectively aggregating many people's responses can produce faster, better, and more reliable results - at much larger scale - than lone individuals can, at least errors andd biases are independent events. Our novel contribution is an explicit integration of learning. Hypotheses: Learning improves quality of work on relevant problems. Working on relevant real-world problems improves learning. Working while learning improves learners’ en- gagement with the learning material.

A3C: Asynchronous Advantage Actor-Critic

Reading Group: Asynchronous Methods for Deep Reinforcement Learning (Mnih et al.)

Motivation

The sequence of observed data encountered by an online RL agent is non-stationary, and online RL updates are strongly correlated. By storing the agent's data in an experience replay memory, the data can be batched or randomly sampled from different time steps.

Drawbacks of Experience Replay

  • Uses more memory and computation per interaction
  • Requires off-policy learning algorithms

Asynchronous RL Framework

The paper presents multi-threaded asynchronous variants of:

  • One-step SARSA
  • One-step Q-learning
  • N-step Q-learning
  • Advantage actor-critic (A3C)

Key insight: Actor-critic is an on-policy search method while Q-learning is an off-policy value-based method. Running multiple agents in parallel on different threads provides diverse, decorrelated experience without replay memory.

Key Benefits

  • Decorrelated updates: Different threads explore different parts of the environment
  • No replay memory needed: Enables on-policy methods like actor-critic
  • CPU-friendly: Runs on multi-core CPUs rather than requiring GPUs

Related Work

  • Gorila Framework: Distributed RL with parameter servers
  • Hogwild!: Lock-free parallel SGD

Actor-Critic Algorithms: Implementation Notes

Implementing and understanding actor-critic algorithms

Overview

Actor-critic methods combine the benefits of policy gradient methods (the actor) with value function approximation (the critic). The actor learns a policy, while the critic evaluates states to reduce variance in policy gradient estimates.

The Algorithm

Following Sergey Levine's lecture, the batch actor-critic algorithm works as follows:

Batch Actor-Critic Algorithm

Key Components

Actor: The policy network \(\pi_\theta(a|s)\) that maps states to action distributions.

Critic: The value network \(V_\phi(s)\) that estimates expected returns from a state.

Advantage: The advantage function \(A(s,a) = Q(s,a) - V(s)\) tells us how much better an action is compared to the average.

Why Actor-Critic?

  • Lower variance: Using a learned baseline (critic) reduces variance compared to REINFORCE
  • Online learning: Can update after each step, not just at episode end
  • Continuous actions: Works well with continuous action spaces

Implementation: actorCritic.py