Introducing SimpleStories
12 Mar 2024
Lennart Finke, Juan Diego Rodriguez, Thomas Dooms, Mat Allen, Thomas Marshall, Noa Nabeshima, Dan Braun
Motivation
To interpret deep neural networks, one needs to answer two questions simultaneously: “what to look for and how?“. It is both unknown what deep model’s internal mechanisms are and how they are represented. This stems for a lack of structure in most datasets, containing random scraps of data from the internet. Hence, there is increasing need for structured datasets that can help guide interpretability research toward concrete goals.
One milestone toward this is TinyStories, a dataset containing millions of children’s stories. Tiny language models (~10 million parameter) trained on this dataset can generate coherent and creative stories. The combination of tiny models and constrained problem space forms a perfect testbed for understanding these models.
Unfortunately, this dataset suffers from a few issues:
- Stories are generated using old models, leading to formulaic and/or incoherent plots.
- Some stories are ‘corrupted’, containing large amounts of nonsense characters.
- The vocabulary isn’t constrained enough, leading to complex or even misspelled words.
Furthermore, there are ample opportunities for improvement:
- Fine-grained labels about aspects of the story (tone, moral, topic) that can serve for supervised probing/finetuning.
- Strong focus on simplicity, while retaining diversity.
Keeping this in mind, we created SimpleStories.
How to cite
@inproceedings{ finke2025tiny, title={[Tiny] Parameterized Synthetic Text Generation with SimpleStories}, author={Lennart Finke and Juan Diego Rodriguez and Thomas Dooms and Mat Allen and Thomas Marshall and Noa Nabeshima and Dan Braun}, booktitle={Will Synthetic Data Finally Solve the Data Access Problem?}, year={2025}, url={https://openreview.net/forum?id=JO8CtTXOsH} }