Will there be substantive issues with Safe AI’s claim to forecast better than the Metaculus crowd, found before 2025?
➕
Plus
101
Ṁ57k
Dec 31
72%
chance

https://www.safe.ai/blog/forecasting

Some kind of way it is not true in a common sense way. Things that would resolve this yes (draft):

  • We look at the data and it turns out that information was leaked to the LLM somehow

  • The questions were selected in a way that chose easy questions

  • The date of forecast was somehow chosen so as to benefit the LLM

  • This doesn't continue working over the next year of questions (more accurately than last year of metaculus crowd, ie the crowd can't win because it gets more accurate)

  • The AI was just accessing forecasts and parroting them.

https://x.com/DanHendrycks/status/1833152719756116154

Get
Ṁ1,000
and
S1.00
Sort by:

I don't really know what to do here. I don't think I have seen what I consider to be robust enough evidence yet, though I still think it was very fishy.

What do you all think?

@NathanpmYoung one hypothesis that I don't see discussed much yet is re: forecasting freshness effects (as distinct from information leakage):

The Metaculus community forecast is a (recency weighted) average of a set of forecasts with varying ages; depending on the question activity level, there can be substantial lag. If the benchmark compared a set of forecasts by the model at time T with the community forecast at time T, assuming no information leakage, the model has an advantage from this alone.

From experience, it's very possible to beat the community forecast with this advantage, particularly if the question isn't "in focus" (e.g. not part of a tournament or top-of-mind as a topic in current events). This is true even with n(forecasters) above the cutoff SafeAI used here (20)

For example, in the acx tournament due to the scoring cutoff, there's many questions with hundreds of forecasters that have no real reason to return to update their forecasts

This hypothesis is consistent with other observations about the system's performance (e.g. that it underperforms on shorter questions where this effect might be less to its advantage)

In order to validate or disprove this hypothesis, one could:

  • with Safe AI and Metaculus's support, review the questions forecast against and break down performance by freshness of community forecast

    • something like, for the subset of questions with at least X% of forecasts made within T time of the cutoff chosen for the benchmark community forecast, do the results still hold?

  • Run the experiment again, controlling for community forecast freshness

    • e.g. constrain the questions chosen to ones where the gain in num forecasters over the preceding week is at least X

  • enter the bot into the Metaculus AI Benchmarking contest, which mostly controls for this (the benchmark forecasts are at most a couple days stale vs the bot forecasts)

@NathanpmYoung The frustrating thing here is that a question like this depends heavily on the judgement of the person making the resolution, and your previous comments suggested that you were leaning pretty heavily toward yes. Has something changed, or did I misinterpret you?

From what I can tell, no one is discussing this anymore, and their demo was taken down pretty quickly. So, I doubt there will be any new evidence. They also seem to have no interest in their own forecaster or providing evidence e.g., by entering it into competitions. I think all signs point to their claim being debunked/exaggerated, and most implicitly not believing their claim, but no one has cared enough to try to rigorously disprove the claim further than what has been done.

bought Ṁ150 YES from 69% to 73%

A recent report from Tetlock shows GPT-4 level models matching crowd median forecasts in a prospective setting: https://arxiv.org/abs/2409.19839. It's unclear how these humans compare to the Metaculus crowd, but it's still a noteworthy result.

@MantasMazeika The models which (almost) match crowd median use freeze values (they are provided with the crowd forecast). It seems they didn't rank static freeze values, I wonder how that would have compared.

The model used in SafeAI's forecaster seems quite bad according to the leaderboard referenced in that paper: LLM Data Table

SafeAI's model was GPT-4o (scratchpad with news), which performed even worse than GPT-4o (scratchpad)

Superforecaster median score: .093

Public median score: .107

Claude-3-5-Sonnet-20240620 (scratchpad with freeze values): 0.111

GPT-4o (scratchpad): 0.128

GPT-4o (scratchpad with news): 0.134

@DanM A few points:
- They did not benchmark our FiveThirtyNine bot. The specific scaffolding used matters a lot, and their scaffolding is the prompt from Halawi et al. This same scaffolding significantly underperformed our scaffolding in our evaluation (0.1221 -> 0.0999 Brier score).
- Claude 3.5 Sonnet w/out freeze values (#12 in the leaderboard) is still within error of the crowd median forecast, although only barely. I agree that it would be interesting to see how the freeze values alone compare.

I haven't had time to dig closely into this study, but I wonder if this is enough to resolve:
https://x.com/metaculus/status/1839069315422646456

@Harlan Note that the Metaculus blog post actually shows that two bots matched the median pro forecasters in their evaluation, which seems to contradict their claim earlier in the post that "AI forecasting does not match ... the performance of experienced forecasters". Regarding how their post bears on our specific evaluation, they state that our "methodology has been highly contested" and link to a LW post by FutureSearch. That post itself doesn't provide new information, but rather repeats and links to several concerns that we fully addressed in our earlier response to Halawi.

Right, okay, question seems to be how to resolve this.

If someone else wants to separately rerun the test, then I think with Halawi's work that would probably be enough for me.

We could enter it into the Metaculus AI forecasting tournament.

I almost think the Platt scaling alone is enough for a yes resolution.

Thoughts? In particular from @AdamK and @DanHendrycks if you want to push back.

@NathanpmYoung Hey, just reaching out to point out our response to Halawi, which seems to not have been linked here yet: https://x.com/justinphan3110/status/1834719817536073992

It turns out their evaluation focused on questions that we explicitly mentioned in our blog post as limitations of the system. When evaluated on the Metaculus subset of their questions, the results actually support our initial claim.

Regarding the phrasing of the question, "substantive issues" strongly suggests that you would think our initial claim is incorrect if you decided that the question should resolve true. It's important to keep in mind what our claim actually is: matching crowd accuracy, subject to certain limitations described in the blog post. I don't think Halawi's post or other points that have been raised support this. Perhaps the question title and resolution criteria could be harmonized to address this.

@MantasMazeika do you want to respond to FutureSearch either here or on the less wrong post?

@DavidFWatson IIRC they were largely repeating points that had already been raised and which we had already addressed (see above).

@MantasMazeika the LessWrong post seems pretty substantive to me, and it appears to be more widely read than your tweet which you claim rebuts it

Thanks everyone for the quick debunking here.

Our team at FutureSearch took our time, and re-read all the papers this year making these claims. Here's our takedown: https://www.lesswrong.com/posts/uGkRcHqatmPkvpGLq/contra-papers-claiming-superhuman-ai-forecasting

@DanSchwarz If this isn't enough to resolve, I'm not sure what is.

Here is the current state of this discussion using votes from here and LessWrong. Seems like there is a lot we agree on.

bought Ṁ1,000 YES

https://twitter.com/dannyhalawi15/status/1833295067764953397

Thread finds much worse performance and names a few issues.

The results in "LLMs Are Superhuman Forecasters" don't hold when given another set of forecasting questions. I used their codebase (models, prompts, retrieval, etc.) to evaluate a new set of 324 questions—all opened after November 2023. Findings: Their Brier score: .195 Crowd Brier score: .141

First issue: The authors assumed that GPT-4o/GPT-4o-mini has a knowledge cut-off date of October 2023. However, this is not correct. For example, GPT-4o knows that Mike Johnson replaced Kevin McCarthy as speaker of the house. 1. This event happened at the end of October. 2. This also happens to be a question in the Metaculus dataset.

I made a poll to test the views of this comment section (and possibly lesswrong) so we can figure out ways to go forward. It takes 2 minutes to fill in.

https://viewpoints.xyz/polls/ai-forecasting

Do we want a new market on whether it will beat the crowd on future questions, somehow?

@NathanpmYoung I was thinking of exactly that - e.g. would it beat the Metaculus community prediction on the next Metaculus tournament - but at this point I think it wouldn't be interesting because it would be in the high 90s and mostly a discount rate question.

Here's a quote from the tool (it landed on 7% chance for the Bills and Chiefs to play in the Super Bowl)

Reflecting on the initial probability, it's important to consider the base rate of any two specific teams meeting in the Super Bowl. Historically, the probability of any two specific teams from the same conference meeting in the Super Bowl is quite low due to the number of variables and potential upsets in the playoffs. The Chiefs and Bills are both top contenders, but the AFC's competitiveness and the single-elimination format of the playoffs reduce the likelihood of both teams making it through. The initial probability of 0.08 (8%) seems reasonable given the strengths of both teams but also the inherent uncertainties and challenges they face. Considering the base rates and the specific strengths and challenges of the Chiefs and Bills, the final probability should be slightly adjusted to account for the competitive nature of the AFC and the playoff structure.

We're pretty sure that contamination from news articles was not an issue in our reported Metaculus evals. Here are guardrails we used:

  1. We inject before:{date} into search queries

  2. We use news search instead of standard search (this will exclude websites like wikipedia, etc.), which are more time-bound and where edits post-publication are clearly marked

  3. We publicly forked newspaper4k to look for updated time beside merely the created time of each article to make sure we correctly filtered based on updated time

  4. Finally, in the released code base, before forecasting we validate the time again (and also reject all of the articles with unknown time)

As an extra check, we also had GPT-4o look over all of the articles we used for each Metaculus forecast, checking whether the publish date or the content of the article leaked information past the forecast date of the model. We also manually looked through several dozen examples. We could not find any instances of contamination from news articles provided to the model. We also couldn't find any instances of prediction market information being included in model sources.

In light of the comments/scrutiny we've received, and the extra checks we did here, I'm much more confident in the veracity of our work. I commit to betting $1K mana on NO at market price over the next 24h to signal this. There were a few other sources of skepticism which we have not directly checked, such as claims that GPT-4o has factual knowledge which extends beyond its pretraining cutoff, though I am skeptical that a phenomenon like this will turn out to exist in a way which would significantly affect our results. The fact that our forecasts are retroactive always opens up the possibility for issues like this, and the gold standard is of course prospective forecasting, but I think we've managed to sanity-check and block sources of error to a reasonable degree.

I'm not sure what standard this market will use to decide whether/how stuff like Platt scaling/scoring operationalizations might count as a "substantive issue," but I'm decently confident that the substance of our work will hold up to historical scrutiny: scaffolded 2024-era language models appear to perform at/above the human crowd level on the Metaculus distribution of forecasting questions, within the bounds of the limitations we have described (such as poor model performance close to resolution). We also look forward to putting a more polished report with additional results on the arxiv at some point.

Long has mentioned that he's happy to answer further questions about the system by email (to long@safe.ai), but we're not expecting to post further clarifications here in the interest of time.

@AdamK I think the whole "your paper falls to replicate" issue is pretty serious RE: 'historical scrutiny', but YMMV

@DavidFWatson Agreed. I look forward to seeing the prospective performance of the system in the coming weeks and months. I can’t speak to Halawi’s claims in particular, but clearly if our system fails to be comparable to the crowd level on new questions (filtered and scored in a manner similar to our evals, assuming the community finds our filtering/scoring choices reasonable), the framing of our claim would have been a mistake.

@AdamK I appreciate you being willing to bet here. If you would like to nominate a trusted mutual party as arbitrator, let me know.

@NathanpmYoung We may have a shared misunderstanding of the magnitude of his offer to bet. I thought he meant he will spend $1k USD, it seems he means 1000 mana, ie, $1 😂, which he has already bet since making this comment.

@DanM yes, I was clearly confused. I was ready to see this market just get absolutely dominated by that bet. Like the "Biden resigns" bet that's just stuck due to one guy's limit order

@AdamK

> There were a few other sources of skepticism which we have not directly checked, such as claims that GPT-4o has factual knowledge which extends beyond its pretraining cutoff, though I am skeptical that a phenomenon like this will turn out to exist in a way which would significantly affect our results

This skepticism seems clearly true to me for ChatGPT-4o-Latest on poe.com. For example, I asked "Where in Nepal was there an earthquake in November 2023?" and it said:

"In November 2023, a significant earthquake struck western Nepal, specifically affecting the Jumla district and surrounding areas. The earthquake, which occurred on November 3, 2023, had a magnitude of 6.4. The tremors were felt across much of western Nepal and parts of neighboring India.

The most affected areas were in the Karnali Province, with districts like Jumla, Dolpa, and Surkhet experiencing notable damage. The earthquake resulted in the loss of lives, injuries, and destruction of property, with aftershocks continuing to cause concern in the days following the main event."

(Link to conversation)

Nov 3 2023, Karnali Province and Jumla are all correct (though the magnitude was 5.7) - the date in particular seems impossible to get without data leakage.

This seems very important, you can't ask it about prediction market questions if it may have been trained on the resolution. I do not think anyone can trust your results without checking for this on the exact 4o checkpoint you used. This may depend on the checkpoint used, and I imagine you used the API not poe.com, so this may not generalise! poe.com also has a "GPT-4o" model which doesn't seem to know as much, so my guess is that it may have leaked in post-training? (I ran out of free credits before I could do much testing)

(Note, if relevant, that I am a NO holder, and think this market is probably overconfident, but it's very important to get this all right!)

@NeelNanda I appreciate your input here. We used gpt-4o 0513. It seems really tricky to determine the extent to which this may have affected retroactive evals, even from looking at the model reasoning traces.

@AdamK A simple test would be to ask it questions like the one I did above, about events in Nov 2023, and see what happens? (you may need some prompt engineering to stop it refusing when it sees Nov 2023 in the prompt, since the system prompt seems to tell it to not answer things post Oct 2023

Comment hidden