Sequence to Sequence Learning with Neural Networks

Using LSTMs for general sequence-to-sequence problems

Deep neural networks are powerful machine learning models that achieve excellent performance on difficult problems. If there exists a parameter setting of a large DNN that achieves good results, supervised backpropagation will find these parameters and solve the problem.

The Problem

Many important problems are best expressed with sequences whose lengths are not known a-priori. For example, speech recognition and machine translation are sequential problems. Likewise, question-answering can also be seen as mapping a sequence of words representing the question to a sequence of words representing the answer.

The Approach

The core idea is to use one LSTM to read the input sequence, one timestep at a time, to obtain a large fixed-dimensional vector representation, and then use another LSTM to extract the output sequence from that vector.

The second LSTM is essentially a recurrent neural network language model, except that it is conditioned on the input sequence.

Key Properties

  • The LSTM learns to map an input sequence of variable length into a fixed-dimensional vector representation
  • The encoder-decoder architecture separates input processing from output generation
  • Attention mechanisms (in later work) address limitations of fixed-size representations

Related Work

  • Kalchbrenner and Blunsom, "Recurrent continuous translation models" (EMNLP, 2013)
  • Cho et al., "Learning phrase representations using RNN encoder-decoder for statistical machine translation" (2014)

Beam Search Algorithm

A heuristic search algorithm for finding optimal sequences

Overview

Beam search is a heuristic search algorithm commonly used in sequence-to-sequence models for tasks like machine translation, speech recognition, and text generation. It explores a graph by expanding the most promising nodes while keeping only a limited number of candidates at each level.

How It Works

Beam search uses breadth-first search to build its search tree. At each level:

  1. Generate all successors of the states at the current level
  2. Sort them in increasing order of heuristic cost
  3. Keep only the top \(\beta\) states (the beam width)
  4. Expand only those states in the next iteration

The greater the beam width, the fewer states are pruned. With \(\beta = 1\), beam search becomes greedy search. With \(\beta = \infty\), it becomes exhaustive breadth-first search.

Why Use Beam Search?

Beam search maintains tractability in large systems with insufficient memory to store the entire search tree. It provides a balance between:

  • Greedy search: Fast but may miss optimal solutions
  • Exhaustive search: Optimal but computationally infeasible

Applications

  • Neural Machine Translation: Finding the most likely translation
  • Speech Recognition: Decoding audio to text
  • Image Captioning: Generating descriptions for images
  • Text Generation: Producing coherent sequences in LLMs

Limitations

Beam search is not guaranteed to find the optimal solution since it prunes the search space. The choice of beam width \(\beta\) is a trade-off between search quality and computational cost.

Multi-Armed Bandits

Balancing exploration and exploitation in sequential decision making

The Problem

Imagine you're in a casino with multiple slot machines (bandits), each with an unknown probability of payout. How do you maximize your total reward over time? You need to balance:

  • Exploration: Trying different machines to learn their payouts
  • Exploitation: Playing the machine you believe has the highest payout

Epsilon-Greedy

The simplest approach: with probability \(\epsilon\), explore randomly; otherwise, exploit the best known action.

  • Simple to implement
  • Explores uniformly, even among clearly suboptimal actions

Upper Confidence Bounds (UCB)

UCB measures potential by an upper confidence bound of the reward value \(\hat{U}_{t}(a)\), so that the true value is below with high probability:

\[ Q(a) \le \hat{Q}_{t}(a) + \hat{U}_{t}(a) \]

The UCB algorithm always selects the action that maximizes this upper confidence bound:

\[ a_{t}^{UCB} = \arg\max_{a \in A} \left( \hat{Q}_{t}(a) + \hat{U}_{t}(a) \right) \]

The confidence bound \(\hat{U}_{t}(a)\) typically decreases as we observe more samples of action \(a\), following the principle of "optimism in the face of uncertainty."

Thompson Sampling

A Bayesian approach that maintains a probability distribution over the expected reward of each action:

  1. Sample a reward estimate from each action's posterior distribution
  2. Select the action with the highest sampled value
  3. Update the posterior based on the observed reward

Thompson Sampling often achieves near-optimal regret bounds while being simple to implement.

Regret

We measure performance by regret: the difference between the reward we would have obtained by always playing the optimal action and what we actually received:

\[ R_T = T \cdot \mu^* - \sum_{t=1}^{T} r_t \]

where \(\mu^*\) is the expected reward of the best action.