The Pile (dataset)

The Pile
Size886.03 GB
TypeOpen-source
LanguageEnglish
Creator(s)EleutherAI
Date of releaseDecember 31, 2020 (2020-12-31)
Main application(s)Training large language models

The Pile is an 886 GB diverse, open-source dataset of English text created as a training dataset for large language models (LLMs). It was constructed by EleutherAI in 2020 and publicly released on December 31 of that year. It is composed of 22 component sub-datasets. The Pile and Common Crawl had been, as of 2024, the two main training datasets being used to train AI models.

Copyright disputes centering around use of The Pile escalated in 2023, prompting Eleuther to start removing some datasets. Eleuther partnered with various organizations to release Common Pile v0.1 in 2025 in order to have a large curated training dataset without the copyright issues.