DataRater: Meta-Learned Dataset Curation
15 Jan 2025Automatically learning which training data points are valuable
I recently open-sourced an implementation of DataRater, a meta-learning framework for automated dataset curation based on the paper accepted at NeurIPS 2025.
The Problem
Foundation model quality depends heavily on training data quality. Current approaches to dataset curation rely on:
- Manual tuning of coarse-grained mixtures of large data buckets
- Filtering by hand-crafted heuristics
These methods lack sophistication and don't scale well. What if we could learn which data points are valuable automatically?
The DataRater Approach
DataRater uses meta-learning to estimate the value of individual data points. The key insight is to use meta-gradients: optimizing data selection to improve performance on held-out validation data.
The framework consists of three main components:
- Inner Models: Task-specific neural networks that train on data weighted by DataRater scores
- DataRater Model: A meta-learner that assigns quality scores to individual training samples
- Meta-Training Loop: An iterative process alternating between inner model training and DataRater optimization
How It Works
During each meta-training iteration:
- DataRater assigns scores to training samples
- Scores are converted to sample weights using softmax normalization
- Inner models train on the reweighted data
- DataRater is optimized based on inner model performance on validation data
The population management strategy maintains multiple inner models that are periodically refreshed, providing diverse gradients for DataRater optimization.
Results
Evaluating on corrupted MNIST data, filtering out the lowest 10% of samples based on DataRater scores achieved ~97.32% test accuracy versus 97.08% for baseline training. The system learns to identify and downweight corrupted or mislabeled samples without explicit supervision.
Key Benefits
- Fine-grained curation: Works at the individual data point level, not coarse buckets
- Improved compute efficiency: Train on higher-quality subsets of data
- Automated and scalable: No manual heuristic design required
Try It Out
The implementation is available at github.com/rishabhranawat/DataRater. Quick start:
pip install -r requirements.txt
sh runbin/mnist_v1.sh
The codebase is designed to be extensible - you can register custom datasets and models through factory functions.