Will the next major LLM by OpenAI use a new tokenizer?
➕
Plus
44
Ṁ1273
2025
77%
chance
  1. The GPT-2 model used r50k_base: vocab size = 50k

  2. The GPT-3 model used r50k_base: vocab size = 50k

  3. The GPT-3.5 model used cl100k_base: vocab size = 100k

  4. The GPT-4 model used cl100k_base: vocab size = 100k

Get
Ṁ1,000
and
S1.00
Sort by:

4o uses a different tokenizer (eg "gumdrop")

https://platform.openai.com/tokenizer

Is 4o "major"?

bought Ṁ50 YES

What if there are significantly more new tokens, e.g. representing images or audio, but the tokens representing text are pretty much unchanged?

@firstuserhere So YES if there's a GPT-4.5/5 that uses a tokeniser not on this list, and NO if there's a GPT-4.5/5 that uses a tokeniser that is on this list?

Do you consider GPT-4-turbo to be a new iteration? What do you quantify as "next major LLM"

@oh No, GPT-4 turbo is part of the same family, does not qualify as the next major LLM release