TinyStories: A Tiny Dataset with Big Impact
A big leap towards feasible Small Language Models (SLMs).
✨ Time of LLMs
Nowadays, LLMs are the hype, especially with the introduction of ChatGPT, Bing and Bard, all released right one after the other. Many products and services around this technology have popped up building an entirely new ecosystem. A new career for “prompt engineering” has been created and way too much social media knowledge around writing prompts and a new “AI” product for almost everything you can imagine.
All of this hype results from GPT-3 and GPT-4 (Generative Pre-trained Transformers), the recently developed Large Language Models (LLMs). These models, however, require tremendous amounts of data to learn, and the training can cost several million dollars in terms of capital requirements and supercomputers to satisfy the necessary computing resources.
Note: This is a very brief and high-level overview of the paper, discussing the primary idea and contribution. I’d highly recommend reading the paper for yourself to gain a better understanding.
✨ The Paper
The paper states that small language models (SLMs) are incapable of generating English text coherently and consistently. It introduces TinyStories, a synthetic dataset which is a collection of short stories that consist of words that 3 to 4-year-olds can usually understand. The stories are generated by GPT-3.5 and GPT-4, hence the “synthetic” label. This dataset is then used to train smaller language models with less than 10 million total parameters and simpler architectures with only one transformer block, which then were able to generate fluent and consistent stories with multiple paragraphs that have almost perfect grammar and the models also demonstrated reasoning abilities.
They state that language is rich and diverse; and language models have demonstrated numerous abilities such as summarization, arithmetic translation and commonsense reasoning as the language models are scaled up in size and trained on humongous amounts of data.
Another factor they considered was the ability to follow instructions bundled in with the language and to test the model’s ability to consider and follow instructions, a secondary dataset TinyStories-Instruct was developed where certain instructions were added preceding the story. These instructions were of 4 types:
- A list of words to be included in the story
- A sentence that should appear somewhere in the story
- A list of features
- A short summary
In addition to this, the paper also introduces a new evaluation framework called GPT-Eval which leverages GPT-4 to grade the smaller models as if the generated content was short stories written by students and graded by a teacher. They claim that this method overcomes the flaws in the standard benchmarking method which requires the model’s output to be structured. Furthermore, it also provides a multi-dimensional score for the model grading different capabilities such as grammar, creativity, and consistency.
✨ Thoughts
As I previously mentioned, this is the era of language models. LLMs require a ton of data, computing resources and money. Hence, limiting the study around these as independent researchers cannot experiment with the underlying tech for themselves and conducting any experiment requires a lot of theoretical studies. SLMs are not given much consideration as they are limited by their complexity.
I believe it's a trend with any new technology as we sacrifice efficiency and optimization to prove it as feasible, following which studies are conducted around optimizing it to strike a balance between efficiency and performance. Language models needed to be complex and big to prove their feasibility, which has been proven beyond doubt. Now is the time when studies will emerge around optimization and efficiency.
We can understand the approach of TinyStories by thinking of this on a very very high level. We have some data and we train a model on that data. Here we have two components to consider — the data and the model. In the case of language models, simple models can handle simple data but not complex data. So, the option is to reduce the complexity of the data. The only way to reduce the complexity of language is to imitate the language used by young kids.
With TinyStories, building feasible SLMs is now possible which will fuel a lot of research in this field for sure and we’ll see some incredible optimizations happening in language models.
✨ Footnote
Hey there, hope you liked the blog post. This was just a brief overview of TinyStories, the paper dives deep into the construction and evaluation of the dataset and the subsequent models. You can give it a read and obtain deeper knowledge.
Consider following me on Medium, Twitter and other platforms to read more about new AI developments and UI/UX concepts.