Connectionist Temporal Classification (CTC)

17 Feb 2019

Labelling Unsegmented Sequence Data with Recurrent Neural Networks (Graves et al.)

The Problem

When training models for sequence-to-sequence tasks like speech recognition or handwriting recognition, we often face a challenge: the input and output sequences have different lengths, and we don't know the alignment between them.

For example, in speech recognition:

Input: Audio frames (e.g., 1000 frames)
Output: Text ("hello" - 5 characters)

Traditional approaches require frame-level alignments, which are expensive to obtain.

CTC Solution

CTC introduces a blank token (often denoted \(\epsilon\)) and defines a many-to-one mapping from frame-level predictions to output sequences. Multiple frame-level paths can map to the same output.

For example, these all map to "cat":

\(\epsilon c a a t \epsilon\)
\(c c \epsilon a t t\)
\(\epsilon \epsilon c a t\)

The CTC Loss

The CTC loss is the negative log probability of the correct output sequence, marginalized over all possible alignments:

\[ L_{CTC} = -\log \sum_{\pi \in \mathcal{B}^{-1}(y)} P(\pi | x) \]

where \(\mathcal{B}^{-1}(y)\) is the set of all paths that map to output \(y\).

Key Properties

Alignment-free: No need for pre-segmented training data
Differentiable: Uses dynamic programming (forward-backward algorithm) for efficient computation
Monotonic: Assumes input-output alignment is monotonic (left-to-right)

Rishabh Ranawat

Connectionist Temporal Classification (CTC)

The Problem

CTC Solution

The CTC Loss

Key Properties

Resources

Rishabh Ranawat

Connectionist Temporal Classification (CTC)

The Problem

CTC Solution

The CTC Loss

Key Properties

Resources

Related Posts

DataRater: Meta-Learned Dataset Curation 15 Jan 2025

XQuiz: AI-Powered Learning from Your Twitter Feed 12 Jan 2025

DataMixer: A Library for Combining Datasets 10 Jan 2025