DataMixer: A Library for Combining Datasets

Flexible data mixing for machine learning pretraining

I've been working on DataMixer, a Python library for combining multiple datasets using customizable algorithms. The goal is to provide a clean, modular framework for controlling dataset composition in ML training pipelines.

The Problem

When training large models, especially for multilingual or multi-domain applications, you often need to combine data from multiple sources. Key questions arise:

  • How much should each dataset contribute to training?
  • How do you balance between high-resource and low-resource domains?
  • How do you prevent catastrophic forgetting of underrepresented categories?

Design Philosophy

DataMixer separates combination strategies from sampling logic. This makes it easy to:

  • Swap in different mixing algorithms
  • Compare approaches on the same data
  • Add new methods without modifying core code

Supported Algorithms

The library currently implements two research-backed approaches:

UniMax (Chung et al., 2023)

  • Designed for multilingual pretraining
  • Provides fair and effective language sampling
  • Caps maximum epochs per language to prevent overfitting on small datasets

UtiliMax (Held et al., 2025)

  • Uses LLM-estimated utility scores
  • Optimizes pretraining data mixtures based on downstream task performance

Usage

from datamixer import DataMixer, UniMax

mixer = DataMixer(
    datasets=[dataset1, dataset2, dataset3],
    algorithm=UniMax(
        budget=6000,
        seed=42,
        max_epochs_per_language=2
    )
)

mixer.mix()
mixed_dataset = mixer.sample(
    datasets=[...],
    output_size=6000
)

# View dataset proportions
print(mixer.proportions)

Future Work

Planning to add:

  • More mixing algorithms from recent research
  • Integration with HuggingFace datasets
  • Visualization tools for dataset composition

Check out the project at github.com/rishabhranawat/DataMixer.