DataRater: Meta-Learned Dataset Curation

Automatically learning which training data points are valuable

I recently open-sourced an implementation of DataRater, a meta-learning framework for automated dataset curation based on the paper accepted at NeurIPS 2025.

The Problem

Foundation model quality depends heavily on training data quality. Current approaches to dataset curation rely on:

  • Manual tuning of coarse-grained mixtures of large data buckets
  • Filtering by hand-crafted heuristics

These methods lack sophistication and don't scale well. What if we could learn which data points are valuable automatically?

The DataRater Approach

DataRater uses meta-learning to estimate the value of individual data points. The key insight is to use meta-gradients: optimizing data selection to improve performance on held-out validation data.

The framework consists of three main components:

  1. Inner Models: Task-specific neural networks that train on data weighted by DataRater scores
  2. DataRater Model: A meta-learner that assigns quality scores to individual training samples
  3. Meta-Training Loop: An iterative process alternating between inner model training and DataRater optimization

How It Works

During each meta-training iteration:

  1. DataRater assigns scores to training samples
  2. Scores are converted to sample weights using softmax normalization
  3. Inner models train on the reweighted data
  4. DataRater is optimized based on inner model performance on validation data

The population management strategy maintains multiple inner models that are periodically refreshed, providing diverse gradients for DataRater optimization.

Results

Evaluating on corrupted MNIST data, filtering out the lowest 10% of samples based on DataRater scores achieved ~97.32% test accuracy versus 97.08% for baseline training. The system learns to identify and downweight corrupted or mislabeled samples without explicit supervision.

Key Benefits

  • Fine-grained curation: Works at the individual data point level, not coarse buckets
  • Improved compute efficiency: Train on higher-quality subsets of data
  • Automated and scalable: No manual heuristic design required

Try It Out

The implementation is available at github.com/rishabhranawat/DataRater. Quick start:

pip install -r requirements.txt
sh runbin/mnist_v1.sh

The codebase is designed to be extensible - you can register custom datasets and models through factory functions.

XQuiz: AI-Powered Learning from Your Twitter Feed

A Chrome extension that turns passive scrolling into active learning

I built XQuiz, a Chrome extension that generates AI-powered quiz questions based on the content you scroll past on Twitter/X. The idea is to enhance focus and information retention while browsing.

The Problem

We scroll through enormous amounts of information on social media, but how much do we actually retain? Most content passes through our attention without sticking. What if we could turn that passive consumption into active learning?

How It Works

  1. Content Tracking: The extension monitors tweets as you scroll through your feed
  2. AI Question Generation: Using Google's Gemini API, it generates quiz questions (multiple choice, true/false, fill-in-the-blank) based on what you've seen
  3. Assessment: Periodic quizzes test your retention of the content
  4. Analytics: Track your scores, accuracy, and answer streaks over time

Features

  • Customizable frequency: Adjust how often quizzes appear
  • Attention timer: Set thresholds for when content counts as "viewed"
  • Video-hiding mode: Optional feature to reduce distractions
  • Quiz history: Review past questions and your performance

Technical Stack

  • Chrome Extension (Manifest V3)
  • Service worker backend for background processing
  • Content script for page monitoring
  • Side panel UI for quiz interface
  • Google Gemini API for question generation

Development Note

This project was built entirely with AI agents as a coding companion, exploring how far AI-assisted development can go for a complete, functional product.

Check out the project: github.com/rishabhranawat/xquiz

DataMixer: A Library for Combining Datasets

Flexible data mixing for machine learning pretraining

I've been working on DataMixer, a Python library for combining multiple datasets using customizable algorithms. The goal is to provide a clean, modular framework for controlling dataset composition in ML training pipelines.

The Problem

When training large models, especially for multilingual or multi-domain applications, you often need to combine data from multiple sources. Key questions arise:

  • How much should each dataset contribute to training?
  • How do you balance between high-resource and low-resource domains?
  • How do you prevent catastrophic forgetting of underrepresented categories?

Design Philosophy

DataMixer separates combination strategies from sampling logic. This makes it easy to:

  • Swap in different mixing algorithms
  • Compare approaches on the same data
  • Add new methods without modifying core code

Supported Algorithms

The library currently implements two research-backed approaches:

UniMax (Chung et al., 2023)

  • Designed for multilingual pretraining
  • Provides fair and effective language sampling
  • Caps maximum epochs per language to prevent overfitting on small datasets

UtiliMax (Held et al., 2025)

  • Uses LLM-estimated utility scores
  • Optimizes pretraining data mixtures based on downstream task performance

Usage

from datamixer import DataMixer, UniMax

mixer = DataMixer(
    datasets=[dataset1, dataset2, dataset3],
    algorithm=UniMax(
        budget=6000,
        seed=42,
        max_epochs_per_language=2
    )
)

mixer.mix()
mixed_dataset = mixer.sample(
    datasets=[...],
    output_size=6000
)

# View dataset proportions
print(mixer.proportions)

Future Work

Planning to add:

  • More mixing algorithms from recent research
  • Integration with HuggingFace datasets
  • Visualization tools for dataset composition

Check out the project at github.com/rishabhranawat/DataMixer.