Connectionist Temporal Classification (CTC)
17 Feb 2019Labelling Unsegmented Sequence Data with Recurrent Neural Networks (Graves et al.)
The Problem
When training models for sequence-to-sequence tasks like speech recognition or handwriting recognition, we often face a challenge: the input and output sequences have different lengths, and we don't know the alignment between them.
For example, in speech recognition:
- Input: Audio frames (e.g., 1000 frames)
- Output: Text ("hello" - 5 characters)
Traditional approaches require frame-level alignments, which are expensive to obtain.
CTC Solution
CTC introduces a blank token (often denoted \(\epsilon\)) and defines a many-to-one mapping from frame-level predictions to output sequences. Multiple frame-level paths can map to the same output.
For example, these all map to "cat":
- \(\epsilon c a a t \epsilon\)
- \(c c \epsilon a t t\)
- \(\epsilon \epsilon c a t\)
The CTC Loss
The CTC loss is the negative log probability of the correct output sequence, marginalized over all possible alignments:
\[ L_{CTC} = -\log \sum_{\pi \in \mathcal{B}^{-1}(y)} P(\pi | x) \]where \(\mathcal{B}^{-1}(y)\) is the set of all paths that map to output \(y\).
Key Properties
- Alignment-free: No need for pre-segmented training data
- Differentiable: Uses dynamic programming (forward-backward algorithm) for efficient computation
- Monotonic: Assumes input-output alignment is monotonic (left-to-right)