Connectionist Temporal Classification (CTC)

Labelling Unsegmented Sequence Data with Recurrent Neural Networks (Graves et al.)

The Problem

When training models for sequence-to-sequence tasks like speech recognition or handwriting recognition, we often face a challenge: the input and output sequences have different lengths, and we don't know the alignment between them.

For example, in speech recognition:

  • Input: Audio frames (e.g., 1000 frames)
  • Output: Text ("hello" - 5 characters)

Traditional approaches require frame-level alignments, which are expensive to obtain.

CTC Solution

CTC introduces a blank token (often denoted \(\epsilon\)) and defines a many-to-one mapping from frame-level predictions to output sequences. Multiple frame-level paths can map to the same output.

For example, these all map to "cat":

  • \(\epsilon c a a t \epsilon\)
  • \(c c \epsilon a t t\)
  • \(\epsilon \epsilon c a t\)

The CTC Loss

The CTC loss is the negative log probability of the correct output sequence, marginalized over all possible alignments:

\[ L_{CTC} = -\log \sum_{\pi \in \mathcal{B}^{-1}(y)} P(\pi | x) \]

where \(\mathcal{B}^{-1}(y)\) is the set of all paths that map to output \(y\).

Key Properties

  • Alignment-free: No need for pre-segmented training data
  • Differentiable: Uses dynamic programming (forward-backward algorithm) for efficient computation
  • Monotonic: Assumes input-output alignment is monotonic (left-to-right)

Resources