Automatic Generator Multi Arm Bandit

This is a reading of Program2Tutor by Tongu Mu, Karan Goel and Emma Brunskill. Problem: Determining topic relationships and statistical student learning modeling is a comlpex process. Prior Work: 1. Prerequisite or knowledge graph of the related matericl can be automatically constructed given an algorithmic representation of the underlying material. 2. How to automatically progress a student through a knowledge graph with minimal assumptions about a model of the underlying student learning process. Contribution: 1. Preliminary work on uniting the idea of automatically generating a knowledge graph and progressing a student while learning. 2. Novel approach for identifying the initial background knowledge of the learner. In this work, the authors unify the ideas of automatic curriculum generation from execution traces and automatic problem selection using reinforcement learning techniques. Specifically, they use multi-armed bandit algorithm for problem selection (ZPDES) as it is less reliant than other methods on the underlying student learning model which can be advantageous. Additionally, the authors build on prior work in probabilistically detecting the knowledge boundary of the student and present a novel approach to determine the initial knowledge state of the student within the curriculum. Automatic Curriculum Generation from Execution Traces Let \( n \) be any positive integer. We say that a trace \( T_{1}\) is at least as complex as trace \( T_{2} \) if every n-gram of trace \( T_{2} \) is also present in trace \( T_{1}\). Progressing Students Using Multi-Armed Bandits This algorithm at each timestep selects a problem from within the set of problem types on the boundary of the student's knowledge, or the ZPD, that it predicts will give the most most reward, which is measured in terms of student learning progress. On initialization, all problems start with an initial unnormalized weight \( w_{a} = w_{i}\). The weights are normalized to ensure a proper probability distribution: The weights of the problems \( w_{a}\) in the ZPD are normalized and denoted as \( w_{a}^{n} \) and with probability \( p_{a} = w_{a}^{n}(1 - \gamma) + \gamma \frac{1}{|ZPD|}\), where \( a \in ZPD\). Once a problem is selected, it is presented to the student and correctness of the student answer for the \( i^{th}\) time the problem is presented is recorded as \( C_{i}\). References B. Clement, D. Roy, P.-Y. Oudeyer, and M. Lopes. Multi-armed bandits for intelligent tutoring systems. JEDM-Journal of Educational Data Mining, 7(n), 2015.

Rl Lecture 1

Reinforcement Learning - Lecture 1 - Emma Brunskill Learning to make good sequential decisions Learning to make good decisions -- some notion of optimiality/utilty measure of the decisions being made. Making good decisions under uncertainty Delayed consequences When planning: decisions involve resasoning about not just immediate benefit of a decision but also its longer term ramifications. When learning: temporarl credit assignment is hard (what caused later high or low rewards?) Policy is mapping from past experience to action RL Optimization Exploration Generalization Delayed Consequences Supervised Learning is typically making 1 decision instead of a sequence of decisions. Imitation Learning Optimization/Generalization/Delayed Consequences Learns from experience of others Assumes input demos of good policies Imitation + Reinforcement Learning seems promising // How do we proceed? 1. Explore the world/use experience to guide future decisions Questions: 1. Where do these rewards come from? a. What happens if we get the wrong kind of rewards 2. Robustness/Risk sensititvity Sequential Decision Making Under Uncertainty Maximize the total expected future reward (the world is stochastic so the agent will be maximizing rewards in expectation) Teaching agent -- choose a teaching activity -- reward is the student's performance on that particular activity. Machine Teaching -- where the environment is aware that the agent is trying to teach them something and thus, acts in a cooperative way. This could in an adverserial way as well. The famous Markov assumption -- sufficient statistic of history State $$s_t$$ is Markov if and only if: $$p(s_{t+1} | s_t, a_t) = p(s_{t+1} | h_{t}, a_{t})$$ Finite horizon vs. Inifinite horizon Stationary vs. Non-stationary

Ctc Loss

Connectionist Temporal Classification: Labelling Unsegmented Sequencce Data with Recurrent Neural Networks by Graves et. al.
Useful links:
  • https://distill.pub/2017/ctc/
  • https://stats.stackexchange.com/questions/320868/what-is-connectionist-temporal-classification-ctc
  • https://towardsdatascience.com/intuitively-understanding-connectionist-temporal-classification-3797e43a86c