How Does Model Distillation Impact Circuits?

30 Aug 2024

Using cross-network patching to compare circuits between GPT2-Small and DistilGPT2

This research explores how neural circuits are preserved (or transformed) during model distillation. The central question: how comparable is a circuit identified in a transformer model to the corresponding circuit in its distilled version?

The Approach: Cross-Network Patching

I introduced a technique called cross-network patching, which extends traditional activation patching to compare circuits across different model architectures. By examining GPT2-Small (85M parameters, 12 layers) and DistilGPT2 (42M parameters, 6 layers), we can investigate how information is compressed during distillation.

Task: Pronoun Resolution

We focused on the "Choosing The Right Pronoun" task, testing the model's ability to select contextually appropriate pronouns. The primary metric is the average difference in logits between correct and incorrect pronoun predictions.

Methodology

Direct Logit Attribution: Identify layers, heads, and attention patterns crucial for pronoun resolution in both models
Cross-Model Component Mapping: Map analogous components between GPT2-Small and DistilGPT2
Cross-Network Patching: Patch activations between models and measure impact on performance

Key Findings

Component Mapping:

GPT2-Small layers \(\geq\) L10 map to DistilGPT2 layers \(\geq\) L4
GPT2-Small L10H9 ↔ DistilGPT2 L5H1
GPT2-Small L11H8 ↔ DistilGPT2 L5H8

Patching Results:

Upswap (replace distilled with original): Improves average logit difference (+68%) but reduces accuracy (-7.7%)
Downswap (replace original with distilled): No significant performance degradation
Attention patterns from specific impactful heads transfer better than expected

Conclusions

Circuit Preservation: The pronoun resolution circuit is largely preserved during distillation
Transferability: Attention patterns from impactful heads transfer seamlessly between models
Confidence vs Accuracy Trade-off: Upswapping increases model confidence but decreases accuracy

Future Directions

Use ACDC for automated circuit discovery
Apply path patching for deeper analysis
Extend to other language tasks
Develop distillation techniques that specifically preserve critical circuits

Code: github.com/rishabhranawat/cross-network-patching

This work builds on tools from Neel Nanda and Arthur Conmy, and extends Mathwin et al.'s work on pronoun identification circuits.

Rishabh Ranawat

How Does Model Distillation Impact Circuits?

The Approach: Cross-Network Patching

Task: Pronoun Resolution

Methodology

Key Findings

Conclusions

Future Directions

Related Posts

DataRater: Meta-Learned Dataset Curation 15 Jan 2025

XQuiz: AI-Powered Learning from Your Twitter Feed 12 Jan 2025

DataMixer: A Library for Combining Datasets 10 Jan 2025