Will any widely used LLM be pre-trained with abstract synthetic data before 2030? | Manifold

Will any widely used LLM be pre-trained with abstract synthetic data before 2030?

Plus

4

Ṁ774

2030

72%

chance

1D

1W

1M

ALL

For this purpose, abstract synthetic data refers to data generated by an algorithm that can be stored in less than 100MB, such as an algorithm that randomly generates programs and runs them.

Motivation:

Neural network models can learn the same task through different methods:

Pre-training: 10^7-10^8 samples
Fine-tuning: 500-50000 samples
Few-shot learning: 5-10 samples

Initially, most models were directly pre-trained on the required task, such as digit classification.
Later, models were pre-trained on more general but still directly useful tasks, such as classifying images into thousands of classes via supervised learning, and then fine-tuned on the required task.
Currently, models are pre-trained on seemingly less useful tasks, like next-token prediction, then fine-tuned on more useful tasks, such as question answering. The final task could be seen as few-shot or zero-shot learned.
In the future, models might be pre-trained on completely abstract tasks, such as predicting the initial state of a Turing machine from its output. This approach could enable them to learn tasks requiring longer context and deeper reasoning while being cheaper to generate in terms of infrastructure. They could then learn about the real world through fine-tuning and/or few-shot learning.

State-of-the-art models for in-context tabular data prediction, such as TabPFN, are already trained on fully synthetic data.

#️ Technology

#Technical AI Timelines

Get

1,000

and

1.00

Sort by:

Pre-training image models on fractals is already a common practice. For example, see:

Related questions

By 2025 end, will it be generally agreed upon that LLM produced text/code > human text/code for training LLMs?

Will the next major LLM by OpenAI use a new tokenizer?

Will there be a state-of-the-art LLM that is NOT based on next raw token prediction before 2029?

Will there be any major breakthrough in LLM continual learning before 2030?

Will there be any major breakthrough in LLM continual learning before 2029?

Will one of the major LLMs be capable of continual lifelong learning (learning from inference runs) by EOY 2025?

Will researchers extract a novel program from the weights of an LLM into a Procedural/OO programming language by 2026?

By 2027, will it be generally agreed upon that LLM produced text > human text for training LLMs?

By 2029 end, will it be generally agreed upon that LLM produced text/code > human text/code for training LLMs?

❓ Which AI model will lead the LLM race by the end of 2025?

Related questions

By 2025 end, will it be generally agreed upon that LLM produced text/code > human text/code for training LLMs?

Will one of the major LLMs be capable of continual lifelong learning (learning from inference runs) by EOY 2025?

Will the next major LLM by OpenAI use a new tokenizer?

Will researchers extract a novel program from the weights of an LLM into a Procedural/OO programming language by 2026?

Will there be a state-of-the-art LLM that is NOT based on next raw token prediction before 2029?

By 2027, will it be generally agreed upon that LLM produced text > human text for training LLMs?

Will there be any major breakthrough in LLM continual learning before 2030?

By 2029 end, will it be generally agreed upon that LLM produced text/code > human text/code for training LLMs?

Will there be any major breakthrough in LLM continual learning before 2029?

❓ Which AI model will lead the LLM race by the end of 2025?

Terms & Conditions•Privacy Policy•Sweepstakes Rules