How Does Model Distillation Impact Circuits?

Using cross-network patching to compare circuits between GPT2-Small and DistilGPT2

This research explores how neural circuits are preserved (or transformed) during model distillation. The central question: how comparable is a circuit identified in a transformer model to the corresponding circuit in its distilled version?

The Approach: Cross-Network Patching

I introduced a technique called cross-network patching, which extends traditional activation patching to compare circuits across different model architectures. By examining GPT2-Small (85M parameters, 12 layers) and DistilGPT2 (42M parameters, 6 layers), we can investigate how information is compressed during distillation.

Task: Pronoun Resolution

We focused on the "Choosing The Right Pronoun" task, testing the model's ability to select contextually appropriate pronouns. The primary metric is the average difference in logits between correct and incorrect pronoun predictions.

Methodology

  1. Direct Logit Attribution: Identify layers, heads, and attention patterns crucial for pronoun resolution in both models
  2. Cross-Model Component Mapping: Map analogous components between GPT2-Small and DistilGPT2
  3. Cross-Network Patching: Patch activations between models and measure impact on performance

Key Findings

Component Mapping:

  • GPT2-Small layers \(\geq\) L10 map to DistilGPT2 layers \(\geq\) L4
  • GPT2-Small L10H9 ↔ DistilGPT2 L5H1
  • GPT2-Small L11H8 ↔ DistilGPT2 L5H8

Patching Results:

  • Upswap (replace distilled with original): Improves average logit difference (+68%) but reduces accuracy (-7.7%)
  • Downswap (replace original with distilled): No significant performance degradation
  • Attention patterns from specific impactful heads transfer better than expected

Conclusions

  • Circuit Preservation: The pronoun resolution circuit is largely preserved during distillation
  • Transferability: Attention patterns from impactful heads transfer seamlessly between models
  • Confidence vs Accuracy Trade-off: Upswapping increases model confidence but decreases accuracy

Future Directions

  • Use ACDC for automated circuit discovery
  • Apply path patching for deeper analysis
  • Extend to other language tasks
  • Develop distillation techniques that specifically preserve critical circuits

Code: github.com/rishabhranawat/cross-network-patching

This work builds on tools from Neel Nanda and Arthur Conmy, and extends Mathwin et al.'s work on pronoun identification circuits.