Solving comma.ai's Camera Calibration Challenge

Achieving 7.77% error using optical flow and median aggregation

I worked on comma.ai's camera calibration challenge, which asks you to predict camera pitch and yaw angles from dashcam video. The goal is to determine how the camera is misaligned relative to the vehicle's direction of travel.

The Key Insight

The critical realization was that camera calibration is constant per video. Unlike per-frame predictions that need smoothing, I could aggregate optical flow data across all frames and use the median as a robust calibration estimate.

The Algorithm

The solution uses the Focus of Expansion (FOE) theory: when a camera moves forward, the optical flow radiates outward from a single point. If the camera is misaligned, this point shifts from the image center.

Per-frame pipeline:

  1. Shi-Tomasi corner detection (up to 3,000 features per frame)
  2. Lucas-Kanade pyramidal optical flow tracking
  3. Forward-backward validation to filter bad tracks
  4. RANSAC-based FOE estimation (1,000 iterations)

Video-level aggregation:

  • Collect FOE estimates from all frames
  • Apply median pooling to handle outliers from turns, stops, and tracking failures
  • Convert pixel offset to pitch/yaw angles

Implementation Details

  • ROI masking: exclude sky (top 40%) and hood (bottom 10%)
  • Flow magnitude filtering: only use flows \(\geq 7.0\) pixels for reliability
  • RANSAC with 1,000 iterations for robust FOE estimation

Results

  • Overall MSE: 0.000118
  • Error score: 7.77% (target was <25%)
  • Improvement over baseline: 92.23% error reduction

Code: github.com/rishabhranawat/calib_challenge

How Does Model Distillation Impact Circuits?

Using cross-network patching to compare circuits between GPT2-Small and DistilGPT2

This research explores how neural circuits are preserved (or transformed) during model distillation. The central question: how comparable is a circuit identified in a transformer model to the corresponding circuit in its distilled version?

The Approach: Cross-Network Patching

I introduced a technique called cross-network patching, which extends traditional activation patching to compare circuits across different model architectures. By examining GPT2-Small (85M parameters, 12 layers) and DistilGPT2 (42M parameters, 6 layers), we can investigate how information is compressed during distillation.

Task: Pronoun Resolution

We focused on the "Choosing The Right Pronoun" task, testing the model's ability to select contextually appropriate pronouns. The primary metric is the average difference in logits between correct and incorrect pronoun predictions.

Methodology

  1. Direct Logit Attribution: Identify layers, heads, and attention patterns crucial for pronoun resolution in both models
  2. Cross-Model Component Mapping: Map analogous components between GPT2-Small and DistilGPT2
  3. Cross-Network Patching: Patch activations between models and measure impact on performance

Key Findings

Component Mapping:

  • GPT2-Small layers \(\geq\) L10 map to DistilGPT2 layers \(\geq\) L4
  • GPT2-Small L10H9 ↔ DistilGPT2 L5H1
  • GPT2-Small L11H8 ↔ DistilGPT2 L5H8

Patching Results:

  • Upswap (replace distilled with original): Improves average logit difference (+68%) but reduces accuracy (-7.7%)
  • Downswap (replace original with distilled): No significant performance degradation
  • Attention patterns from specific impactful heads transfer better than expected

Conclusions

  • Circuit Preservation: The pronoun resolution circuit is largely preserved during distillation
  • Transferability: Attention patterns from impactful heads transfer seamlessly between models
  • Confidence vs Accuracy Trade-off: Upswapping increases model confidence but decreases accuracy

Future Directions

  • Use ACDC for automated circuit discovery
  • Apply path patching for deeper analysis
  • Extend to other language tasks
  • Develop distillation techniques that specifically preserve critical circuits

Code: github.com/rishabhranawat/cross-network-patching

This work builds on tools from Neel Nanda and Arthur Conmy, and extends Mathwin et al.'s work on pronoun identification circuits.

FaceNet: Face Recognition and Clustering

A Unified Embedding for Face Recognition and Clustering

Paper: FaceNet: A Unified Embedding for Face Recognition and Clustering
Authors: Florian Schroff (Google), Dimitry Kalenichenko (Google) and James Philbin (Google)
Area: Computer Vision, Clustering, Classification, Deep Learning
Year: 2015
Highlighted Paper

Background:

  • Euclidean space, distance and vector norms
  • Nearest neighbor/k-means/clustering
  • CNNs: Stanford's CS231N course notes, Deep learning book chapter 9
  • Important architectures for this paper:
    1. Zeiler&Fergus
    2. Google LeNet Inception Model
  • Key Contributions:

    1. Learns a mapping from face images to a compact Euclidean space where distances directly correspond to measure of face similarity. The method uses a CNN to directly optimize the embedding itself.
    2. To train - use triplets of roughly aligned matching/non-matching face patches generated using an online triplet mining method.

    Why is this novel? Unlike previous approaches where a final classification layer is used to predict the class, here we are leaning an embedding which can then be used for various classification purposes.

    Model Triplet Loss:
    $$\sum_{i}^{N}[\lVert f(x_{i}^{a}) - f(x_{i}^{p}) \rVert_{2}^{2} - \lVert f(x_{i}^{a}) - f(x_{i}^{n}) \rVert_{2}^{2} + \alpha)]$$ \(a\) is the anchor, \(p\) is a positive and \(n\) is a negative. You essentially want to optimize for making the positive closer to the anchor than the negative.

    Triplet Selection:
    We focus on the online generation and use large mini-batches in the order of a few thousand exemplars and only compute the \(argmin\) and \(argmax\) within that mini-batch.

    CNN Architecturse:
    Existing architectures - Inception model and the Zeiler&Fergus model.

    Some implementations:
    tbmoon's implementation using PyTorch
    timseler's implementation

    Datasets
    Labeled Faces in the Wild (LFW) courtsey of UMass
    YouTube Faces DB