Will Llama 3-multimodal be natively mixed-multimodal? (VQ-VAE+next token prediction) | Manifold

Will Llama 3-multimodal be natively mixed-multimodal? (VQ-VAE+next token prediction)

Plus

2

Ṁ200

Jan 1

50%

chance

1D

1W

1M

ALL

Vision Language Models currently has two common paradigms

The first one is LLaVA, where one assembles a CLIP-like vision block with a LLM through projection.

The second approach is Gemini/LVM, where one uses a VQ-VAE to compress pictures into discrete tokens, then simply do autoregressive next token prediction. It is suspected that GPT-4o is also trained this way, which explains why it can generate images with excellent text rendering.

Note that meta has just announced Chameleon: Mixed-Modal Early-Fusion Foundation Models

Will Llama 3 multi-modal or Llama 3 vision be trained in the second approach?

#Meta (Facebook)

Get

1,000

and

1.00

Related questions

Top 3 Multimodal Vision2Language Model by EOY 2024? (by Organization/Company)

Will OpenAI announce a multi-modal AI capable of any input-output modality combination by end of 2025? ($1000M subsidy)

Will a Mamba 7b model trained on 2 trillion tokens outperform Llama2-13B

By 2030 will we have video-to-video where an LLM can continue any video prompt in any way you like?

Will Llama 4 be the best LLM in the chatbot arena?

Will OpenAI's next major LLM release support video input?

Will Llama-3 (or next open Meta model) be obviously good in its first-order effects on the world?

Will a SOTA open-sourced LLM forecasting system make major use of quasilinguistic neural reps (QNRs) before 2027?

Related questions

Top 3 Multimodal Vision2Language Model by EOY 2024? (by Organization/Company)

Will Llama 4 be the best LLM in the chatbot arena?

Will OpenAI announce a multi-modal AI capable of any input-output modality combination by end of 2025? ($1000M subsidy)

Will OpenAI's next major LLM release support video input?

Will a Mamba 7b model trained on 2 trillion tokens outperform Llama2-13B

Will Llama-3 (or next open Meta model) be obviously good in its first-order effects on the world?

By 2030 will we have video-to-video where an LLM can continue any video prompt in any way you like?

Will a SOTA open-sourced LLM forecasting system make major use of quasilinguistic neural reps (QNRs) before 2027?

Terms & Conditions•Privacy Policy•Sweepstakes Rules