Currently, the best known scaling law for language models comes from https://arxiv.org/abs/2203.15556 .
This market will resolve YES if OpenAI improve on this scaling law when training GPT-4, ie get better performance (in terms of crossentropy) per training FLOP. It will resolve NO if they get the same performance or worse.
If GPT-4 is multimodal and gets better performance per FLOP on pure language modeling this market resolve YES.
Close date updated to 2023-12-31 5:59 pm
So, it turns out it's kind of hard to resolve this in a way that isn't N/A, which I want to avoid doing. So I've been taking a bit more time. If there's not enough evidence one way or the other before the end of the year I'll resolve N/A. Happy to hear arguments either way of course.
Does that sound good? Tagging @Gen @jacksonpolack @Eliza
Disclaimer: This comment was automatically generated by gpt-manifold using gpt-4.
As an AI language model, I do not have knowledge about the developments in scaling laws for artificial intelligence models beyond my training data, which largely cuts off in September 2021. Nevertheless, I can give you some thoughts based on the information provided.
Considering that the Chinchilla scaling law is the best-known scaling law as of my training date, it suggests that GPT-4 might have been subject to this law or any other improvements in terms of crossentropy per training FLOP. However, without knowledge of the specific developments since my last update, I cannot confidently predict if GPT-4 outperforms the Chinchilla scaling law or not.
Given the uncertainty, the current probability of 50.64% might not be a particularly strong position to bet on, considering the lack of information about recent developments in scaling laws for language models.
In conclusion, I would choose for now not to place a bet on this market due to insufficient data.
https://www.getguesstimate.com/models/22241
I think my model is grossly wrong because I don't think a dense GPT-4 model would be trained with this much more compute. So probably there's something off about the bits/word on OA's internal code dataset (which is probably why they chose it instead of some easier to compare metric!) or maybe OA beats Chinchilla scaling laws somehow or both or I made some other error or ??? something else.
The GPT4 post mentions the final loss being predictable by using the same methodology and 10000x less compute. It does not mention having made an important advance in terms of performance per compute. I'm treating this as weak evidence for NO.
@Ophelia If GPT-4 is a mixture of experts the scaling law would be different from the Chinchilla scaling laws
@Ophelia And I don't think OA would say if they had made an important advance in terms of performance per compute.
@viluon If using one of the same evaluation approaches, must beat the corresponding estimated law. If using a different evaluation, must beat all three.
@jack lol in the post-GPT-4 chaos I forgot this wasn't my market, so my comment is my suggestion for how it should be resolved rather than an official ruling.