Will the next major LLM by OpenAI use a new tokenizer? | Manifold

Will the next major LLM by OpenAI use a new tokenizer?

Plus

48

Ṁ2396

Dec 31

89%

chance

1D

1W

1M

ALL

The GPT-2 model used r50k_base: vocab size = 50k
The GPT-3 model used r50k_base: vocab size = 50k
The GPT-3.5 model used cl100k_base: vocab size = 100k
The GPT-4 model used cl100k_base: vocab size = 100k

Get

1,000

and

1.00

Sort by:

bought Ṁ1,000 YES

Uses o200k which is not on the list.

bought Ṁ23 YES

Regardless of which release is considered major for the purpose of the question, OpenAI have moved to a new tokenizer with 4o, and all of their big models released since (o1, o3, 4.5, 5) have also used it.

4o uses a different tokenizer (eg "gumdrop")

https://platform.openai.com/tokenizer

Is 4o "major"?

bought Ṁ50 YES

What if there are significantly more new tokens, e.g. representing images or audio, but the tokens representing text are pretty much unchanged?

@firstuserhere So YES if there's a GPT-4.5/5 that uses a tokeniser not on this list, and NO if there's a GPT-4.5/5 that uses a tokeniser that is on this list?

@chrisjbillington correct

Do you consider GPT-4-turbo to be a new iteration? What do you quantify as "next major LLM"

@GiftedGummyBee Bumping

@oh No, GPT-4 turbo is part of the same family, does not qualify as the next major LLM release

Related questions

Will OpenAI release another open source LLM before end of 2026?

Will OpenAI's next major LLM (after GPT-4) feature natural and convenient speech-to-speech capabilities?

Will a flagship (>60T training bytes) open-weights LLM from Meta which doesn't use a tokenizer be released in 2025?

Will OpenAI's next major LLM (after GPT-4) achieve over 50% resolution rate on the SWE-bench benchmark?

Will there be a state-of-the-art LLM that is NOT based on next raw token prediction before 2029?

Will OpenAI release a tokenizer with more than 210000 tokens before 2026?

Will OpenAI's next major LLM release support video input?

Will OpenAI's next major LLM (after GPT-4) solve more than 2 of the first 5 new Project Euler problems?

What will be true of OpenAI's best LLM by EOY 2025?

Will OpenAI announce AGI before 2028 conditional on it centrally being an LLM?

Related questions

Will OpenAI release another open source LLM before end of 2026?

Will OpenAI release a tokenizer with more than 210000 tokens before 2026?

Will OpenAI's next major LLM (after GPT-4) feature natural and convenient speech-to-speech capabilities?

Will OpenAI's next major LLM release support video input?

Will a flagship (>60T training bytes) open-weights LLM from Meta which doesn't use a tokenizer be released in 2025?

Will OpenAI's next major LLM (after GPT-4) solve more than 2 of the first 5 new Project Euler problems?

Will OpenAI's next major LLM (after GPT-4) achieve over 50% resolution rate on the SWE-bench benchmark?

What will be true of OpenAI's best LLM by EOY 2025?

Will there be a state-of-the-art LLM that is NOT based on next raw token prediction before 2029?

Will OpenAI announce AGI before 2028 conditional on it centrally being an LLM?

Terms & Conditions•Privacy Policy•Sweepstakes Rules