Program2Tutor: Automatic Curriculum Generation with Bandits

Reading: Program2Tutor by Tongmu Mu, Karan Goel, and Emma Brunskill
Problem: Determining topic relationships and statistical student learning modeling is a comlpex process. Prior Work: 1. Prerequisite or knowledge graph of the related matericl can be automatically constructed given an algorithmic representation of the underlying material. 2. How to automatically progress a student through a knowledge graph with minimal assumptions about a model of the underlying student learning process. Contribution: 1. Preliminary work on uniting the idea of automatically generating a knowledge graph and progressing a student while learning. 2. Novel approach for identifying the initial background knowledge of the learner. In this work, the authors unify the ideas of automatic curriculum generation from execution traces and automatic problem selection using reinforcement learning techniques. Specifically, they use multi-armed bandit algorithm for problem selection (ZPDES) as it is less reliant than other methods on the underlying student learning model which can be advantageous. Additionally, the authors build on prior work in probabilistically detecting the knowledge boundary of the student and present a novel approach to determine the initial knowledge state of the student within the curriculum. Automatic Curriculum Generation from Execution Traces Let \( n \) be any positive integer. We say that a trace \( T_{1}\) is at least as complex as trace \( T_{2} \) if every n-gram of trace \( T_{2} \) is also present in trace \( T_{1}\). Progressing Students Using Multi-Armed Bandits This algorithm at each timestep selects a problem from within the set of problem types on the boundary of the student's knowledge, or the ZPD, that it predicts will give the most most reward, which is measured in terms of student learning progress. On initialization, all problems start with an initial unnormalized weight \( w_{a} = w_{i}\). The weights are normalized to ensure a proper probability distribution: The weights of the problems \( w_{a}\) in the ZPD are normalized and denoted as \( w_{a}^{n} \) and with probability \( p_{a} = w_{a}^{n}(1 - \gamma) + \gamma \frac{1}{|ZPD|}\), where \( a \in ZPD\). Once a problem is selected, it is presented to the student and correctness of the student answer for the \( i^{th}\) time the problem is presented is recorded as \( C_{i}\). References B. Clement, D. Roy, P.-Y. Oudeyer, and M. Lopes. Multi-armed bandits for intelligent tutoring systems. JEDM-Journal of Educational Data Mining, 7(n), 2015.

RL Lecture Notes: Sequential Decision Making

Reinforcement Learning - Lecture 1 - Emma Brunskill

Learning to Make Good Sequential Decisions

Key aspects of reinforcement learning:

  • Optimization: Finding policies that maximize expected reward
  • Exploration: Balancing exploration vs exploitation
  • Generalization: Transferring knowledge to new situations
  • Delayed Consequences: Decisions have long-term ramifications

Challenges

When planning: Decisions involve reasoning about not just immediate benefit but also longer-term ramifications.

When learning: Temporal credit assignment is hard - what caused later high or low rewards?

Policy

A policy is a mapping from past experience to action.

Comparison with Other Paradigms

Supervised Learning: Typically making one decision instead of a sequence of decisions.

Imitation Learning: Learns from experience of others, assumes input demos of good policies. Imitation + RL seems promising.

Sequential Decision Making Under Uncertainty

Goal: Maximize the total expected future reward (the world is stochastic so the agent maximizes rewards in expectation).

The Markov Assumption

State \(s_t\) is Markov if and only if:

$$p(s_{t+1} | s_t, a_t) = p(s_{t+1} | h_{t}, a_{t})$$

The current state is a sufficient statistic of history.

Problem Variants

  • Finite horizon vs. infinite horizon
  • Stationary vs. non-stationary

Connectionist Temporal Classification (CTC)

Labelling Unsegmented Sequence Data with Recurrent Neural Networks (Graves et al.)

The Problem

When training models for sequence-to-sequence tasks like speech recognition or handwriting recognition, we often face a challenge: the input and output sequences have different lengths, and we don't know the alignment between them.

For example, in speech recognition:

  • Input: Audio frames (e.g., 1000 frames)
  • Output: Text ("hello" - 5 characters)

Traditional approaches require frame-level alignments, which are expensive to obtain.

CTC Solution

CTC introduces a blank token (often denoted \(\epsilon\)) and defines a many-to-one mapping from frame-level predictions to output sequences. Multiple frame-level paths can map to the same output.

For example, these all map to "cat":

  • \(\epsilon c a a t \epsilon\)
  • \(c c \epsilon a t t\)
  • \(\epsilon \epsilon c a t\)

The CTC Loss

The CTC loss is the negative log probability of the correct output sequence, marginalized over all possible alignments:

\[ L_{CTC} = -\log \sum_{\pi \in \mathcal{B}^{-1}(y)} P(\pi | x) \]

where \(\mathcal{B}^{-1}(y)\) is the set of all paths that map to output \(y\).

Key Properties

  • Alignment-free: No need for pre-segmented training data
  • Differentiable: Uses dynamic programming (forward-backward algorithm) for efficient computation
  • Monotonic: Assumes input-output alignment is monotonic (left-to-right)

Resources