DataMixer: A Library for Combining Datasets

10 Jan 2025

Flexible data mixing for machine learning pretraining

I've been working on DataMixer, a Python library for combining multiple datasets using customizable algorithms. The goal is to provide a clean, modular framework for controlling dataset composition in ML training pipelines.

The Problem

When training large models, especially for multilingual or multi-domain applications, you often need to combine data from multiple sources. Key questions arise:

How much should each dataset contribute to training?
How do you balance between high-resource and low-resource domains?
How do you prevent catastrophic forgetting of underrepresented categories?

Design Philosophy

DataMixer separates combination strategies from sampling logic. This makes it easy to:

Swap in different mixing algorithms
Compare approaches on the same data
Add new methods without modifying core code

Supported Algorithms

The library currently implements two research-backed approaches:

UniMax (Chung et al., 2023)

Designed for multilingual pretraining
Provides fair and effective language sampling
Caps maximum epochs per language to prevent overfitting on small datasets

UtiliMax (Held et al., 2025)

Uses LLM-estimated utility scores
Optimizes pretraining data mixtures based on downstream task performance

Usage

from datamixer import DataMixer, UniMax

mixer = DataMixer(
    datasets=[dataset1, dataset2, dataset3],
    algorithm=UniMax(
        budget=6000,
        seed=42,
        max_epochs_per_language=2
    )
)

mixer.mix()
mixed_dataset = mixer.sample(
    datasets=[...],
    output_size=6000
)

# View dataset proportions
print(mixer.proportions)

Future Work

Planning to add:

More mixing algorithms from recent research
Integration with HuggingFace datasets
Visualization tools for dataset composition

Check out the project at github.com/rishabhranawat/DataMixer.

Rishabh Ranawat

DataMixer: A Library for Combining Datasets

The Problem

Design Philosophy

Supported Algorithms

Usage

Future Work

Related Posts

DataRater: Meta-Learned Dataset Curation 15 Jan 2025

XQuiz: AI-Powered Learning from Your Twitter Feed 12 Jan 2025

Solving comma.ai's Camera Calibration Challenge 08 Jan 2025