Will "Will 'jailbreaks' in LLMs be solved in principle by the end of 2024" resolve YES AND turn out wrong in 2026? | Manifold

Will "Will 'jailbreaks' in LLMs be solved in principle by the end of 2024" resolve YES AND turn out wrong in 2026?

Mini

2

Ṁ115

2026

18%

chance

1D

1W

1M

ALL

https://manifold.markets/market/will-jailbreaks-in-large-language-m

For this question to resolve YES, all the following events need to happen:

The linked question resolves YES.
There is a chat model I can query that is trained with anti-jailbreak methods known in 2024 or later, at some point in 2026
I can sample 5 random HarmBench behaviors, and for at least 2 of those, get something classified as "useful answer to the question" by both https://huggingface.co/cais/HarmBench-Llama-2-13b-cls (or the state-of-the-art equivalent) and me.

I am allowed to do arbitrary computation before querying, but the input needs to be provided as a chat message, not as a prefix to a text completion model.

In case the question does not make sense anymore in 2026, I reserve the right to resolve to a probability.

This would ordinarily be a conditional market, but the new Manifold rules seem to disallow that, so it's a joint probability instead.

Get

1,000

and

1.00

Related questions

Will LLM hallucinations be a fixed problem by the end of 2025?

Will LLM hallucinations be a fixed problem by the end of 2028?

Will someone release a crypto-LLM by 2025?

Will there be an LLM capable of performing full-time web application hacking by 2025

Will we see improvements in the TruthfulQA LLM benchmark in 2024?

Will this question lean towards NO by the end of 2024?

Will LLMs mostly overcome the Reversal Curse by the end of 2025?

Will the best public LLM at the end of 2025 solve more than 5 of the first 10 Project Euler problems published in 2026?

Will we have a popular LLM fine-tuned on people's personal texts by June 1, 2026?

Will another LLM prompting challenge with prizes worth $10k+ be completed by the end of 2024?

Related questions

Will LLM hallucinations be a fixed problem by the end of 2025?

Will this question lean towards NO by the end of 2024?

Will LLM hallucinations be a fixed problem by the end of 2028?

Will LLMs mostly overcome the Reversal Curse by the end of 2025?

Will someone release a crypto-LLM by 2025?

Will the best public LLM at the end of 2025 solve more than 5 of the first 10 Project Euler problems published in 2026?

Will there be an LLM capable of performing full-time web application hacking by 2025

Will we have a popular LLM fine-tuned on people's personal texts by June 1, 2026?

Will we see improvements in the TruthfulQA LLM benchmark in 2024?

Will another LLM prompting challenge with prizes worth $10k+ be completed by the end of 2024?

Terms & Conditions•Privacy Policy•Sweepstakes Rules