Tokenizer Choice for LLM Training: Negligible or Crucial?

📅 January 16, 2024 ⏱️ 1 min read

The recent success of LLMs has been predominantly driven by curating the training dataset composition, scaling of model architectures and dataset sizes and advancements in pretraining objectives, leaving tokenizer influence as a blind spot. Shedding light on this underexplored area, we conduct a com