DataMixer: A Library for Combining Datasets
10 Jan 2025Flexible data mixing for machine learning pretraining
I've been working on DataMixer, a Python library for combining multiple datasets using customizable algorithms. The goal is to provide a clean, modular framework for controlling dataset composition in ML training pipelines.
The Problem
When training large models, especially for multilingual or multi-domain applications, you often need to combine data from multiple sources. Key questions arise:
- How much should each dataset contribute to training?
- How do you balance between high-resource and low-resource domains?
- How do you prevent catastrophic forgetting of underrepresented categories?
Design Philosophy
DataMixer separates combination strategies from sampling logic. This makes it easy to:
- Swap in different mixing algorithms
- Compare approaches on the same data
- Add new methods without modifying core code
Supported Algorithms
The library currently implements two research-backed approaches:
UniMax (Chung et al., 2023)
- Designed for multilingual pretraining
- Provides fair and effective language sampling
- Caps maximum epochs per language to prevent overfitting on small datasets
UtiliMax (Held et al., 2025)
- Uses LLM-estimated utility scores
- Optimizes pretraining data mixtures based on downstream task performance
Usage
from datamixer import DataMixer, UniMax
mixer = DataMixer(
datasets=[dataset1, dataset2, dataset3],
algorithm=UniMax(
budget=6000,
seed=42,
max_epochs_per_language=2
)
)
mixer.mix()
mixed_dataset = mixer.sample(
datasets=[...],
output_size=6000
)
# View dataset proportions
print(mixer.proportions)
Future Work
Planning to add:
- More mixing algorithms from recent research
- Integration with HuggingFace datasets
- Visualization tools for dataset composition
Check out the project at github.com/rishabhranawat/DataMixer.