DataRater: Meta-Learned Dataset Curation

Automatically learning which training data points are valuable

I recently open-sourced an implementation of DataRater, a meta-learning framework for automated dataset curation based on the paper accepted at NeurIPS 2025.

The Problem

Foundation model quality depends heavily on training data quality. Current approaches to dataset curation rely on:

  • Manual tuning of coarse-grained mixtures of large data buckets
  • Filtering by hand-crafted heuristics

These methods lack sophistication and don't scale well. What if we could learn which data points are valuable automatically?

The DataRater Approach

DataRater uses meta-learning to estimate the value of individual data points. The key insight is to use meta-gradients: optimizing data selection to improve performance on held-out validation data.

The framework consists of three main components:

  1. Inner Models: Task-specific neural networks that train on data weighted by DataRater scores
  2. DataRater Model: A meta-learner that assigns quality scores to individual training samples
  3. Meta-Training Loop: An iterative process alternating between inner model training and DataRater optimization

How It Works

During each meta-training iteration:

  1. DataRater assigns scores to training samples
  2. Scores are converted to sample weights using softmax normalization
  3. Inner models train on the reweighted data
  4. DataRater is optimized based on inner model performance on validation data

The population management strategy maintains multiple inner models that are periodically refreshed, providing diverse gradients for DataRater optimization.

Results

Evaluating on corrupted MNIST data, filtering out the lowest 10% of samples based on DataRater scores achieved ~97.32% test accuracy versus 97.08% for baseline training. The system learns to identify and downweight corrupted or mislabeled samples without explicit supervision.

Key Benefits

  • Fine-grained curation: Works at the individual data point level, not coarse buckets
  • Improved compute efficiency: Train on higher-quality subsets of data
  • Automated and scalable: No manual heuristic design required

Try It Out

The implementation is available at github.com/rishabhranawat/DataRater. Quick start:

pip install -r requirements.txt
sh runbin/mnist_v1.sh

The codebase is designed to be extensible - you can register custom datasets and models through factory functions.