Introducing SimpleStories

Lennart Finke , Chandan Sreedhara , Thomas Dooms , Mat Allen , Emerald Zhang , Juan Diego Rodriguez , Noa Nabeshima , Thomas Marshall , Dan Braun

Paper

Models

To interpret deep neural networks, one needs to answer two questions simultaneously: “what to look for and how?”. It is both unknown what deep model’s internal mechanisms are and how they are represented. This stems for a lack of structure in most datasets, containing random scraps of data from the internet. Hence, there is increasing need for structured datasets that can help guide interpretability research toward concrete goals.

One milestone toward this is TinyStories, a dataset containing millions of children’s stories. Tiny language models (~10 million parameters) trained on this dataset can generate coherent and creative stories. The combination of tiny models and constrained problem space forms a perfect testbed for understanding these models.

Unfortunately, this dataset suffers from a few issues:

Furthermore, there are ample opportunities for improvement:

Keeping this in mind, we created SimpleStories.