DataRater: Meta-Learned Dataset Curation

15 Jan 2025

Automatically learning which training data points are valuable

I recently open-sourced an implementation of DataRater, a meta-learning framework for automated dataset curation based on the paper accepted at NeurIPS 2025.

The Problem

Foundation model quality depends heavily on training data quality. Current approaches to dataset curation rely on:

Manual tuning of coarse-grained mixtures of large data buckets
Filtering by hand-crafted heuristics

These methods lack sophistication and don't scale well. What if we could learn which data points are valuable automatically?

The DataRater Approach

DataRater uses meta-learning to estimate the value of individual data points. The key insight is to use meta-gradients: optimizing data selection to improve performance on held-out validation data.

The framework consists of three main components:

Inner Models: Task-specific neural networks that train on data weighted by DataRater scores
DataRater Model: A meta-learner that assigns quality scores to individual training samples
Meta-Training Loop: An iterative process alternating between inner model training and DataRater optimization

How It Works

During each meta-training iteration:

DataRater assigns scores to training samples
Scores are converted to sample weights using softmax normalization
Inner models train on the reweighted data
DataRater is optimized based on inner model performance on validation data

The population management strategy maintains multiple inner models that are periodically refreshed, providing diverse gradients for DataRater optimization.

Results

Evaluating on corrupted MNIST data, filtering out the lowest 10% of samples based on DataRater scores achieved ~97.32% test accuracy versus 97.08% for baseline training. The system learns to identify and downweight corrupted or mislabeled samples without explicit supervision.

Key Benefits

Fine-grained curation: Works at the individual data point level, not coarse buckets
Improved compute efficiency: Train on higher-quality subsets of data
Automated and scalable: No manual heuristic design required

Try It Out

The implementation is available at github.com/rishabhranawat/DataRater. Quick start:

pip install -r requirements.txt
sh runbin/mnist_v1.sh

The codebase is designed to be extensible - you can register custom datasets and models through factory functions.

Rishabh Ranawat

DataRater: Meta-Learned Dataset Curation

The Problem

The DataRater Approach

How It Works

Results

Key Benefits

Try It Out

Related Posts

XQuiz: AI-Powered Learning from Your Twitter Feed 12 Jan 2025

DataMixer: A Library for Combining Datasets 10 Jan 2025

Solving comma.ai's Camera Calibration Challenge 08 Jan 2025