Why do we have weird LLM sizes like 3B, 7B, 13B, 33B, 70B? Who invented those?

The question of why we have weird language model sizes like 3B, 7B and so on has been asked in the LocalLLaMA subreddit. This is due to the fact that language models are very complex and require significant amounts of data and computing power to train them. As a result, it is necessary to divide the training process into several steps or stages. Each stage requires a different amount of data and computing resources, resulting in various sizes for the language model.

3B is the size of the OpenAI GPT-2 model, which was trained with 3 billion parameters. GPT-2 is a large transformer-based language model developed by OpenAI, that obtains state-of-the-art results on many natural language processing tasks. It is made up of a deep neural network with parameters that enable it to learn words and phrases from large amounts of text data.

7B is the size of the Google's BERT model, which was trained with 7 billion parameters. BERT is a pre-training approach for natural language processing (NLP) based on transformers such as GPT. It has become widely-adopted by AI researchers because it can be used to solve a variety of NLP tasks with minimal task-specific fine-tuning.

These are just two examples of language model sizes that are currently available. The size of a language model increases with more data and computing resources, allowing for better prediction accuracy and performance. Additionally, language models are constantly being improved upon, leading to new sizes and capabilities. For instance, the Google-developed XLNet has recently surpassed BERT’s performance in multiple NLP tasks.

Read more here: External Link