Did you know only 4% of the data is publicly available for the whole LLM market?

Jun 11, 2024 ·

We estimate the stock of human-generated public text at around 300 trillion tokens. If trends continue, language models will fully utilize this stock between 2026 and 2032, or even earlier if intensely overtrained.