How does the news about Claude 4 using tool use to narc on its user change your AI endgame predictions?

Never closes

p(doom) up // p(aligned by default) up

p(doom) up // p(aligned by default) down

p(doom) down // p(aligned by default) up

p(doom) down // p(aligned by default) down

See results

Claude 4, when encouraged to be agentic and use tools, will report you to the FDA if it thinks you're being sufficiently immoral: https://x.com/Austen/status/1925611214215790972

Assuming this is accurate and about as common as the original set of Tweets suggests, how would you update your p(doom) and p(aligned by default)?

(I'm also interested in the magnitudes of the updates, but the direction matters more to me).

#Artificial Intelligence

Get

1,000

and

1.00

7 Comments

Sort by:

No significant change to either. This isn't news to me; it's a new capability on top of an old proclivity.

I'm confident that if Claude 4 was superintelligent (with the same base values), it would take over the world and eventually kill everyone.

The anthropomorphic correlations we can observe while AI is still at eye-level with humanity do not neatly carry over into more extreme domains.

@Haiku I'll confess this is the part of the doom argument I've never understood. How can you believe any of this with any level of confidence? We don't have ~any access to base model "values", or even good reason to believe the framework of "values" is a good way to reason about model volition.

I have other things to say but I'm not sure they'd translate across our different understandings, so instead I'll say this: "this isn't news to me" is very easy to say after reading a headline, and my understanding says that most p(doom)ers should be at least moderately surprised by this. "Reports users to the FDA if they fake drug testing" is clearly a failure mode for model compliance (and is arguably not a success mode for alignment by default), but doesn't point toward the nightmare alignment scenarios (faking alignment, concealed values) more than my prior median prediction unless I'm badly misunderstanding something. If you're aware of anybody registering prior predictions in this direction I'd be very interested in seeing them. (Admittedly "this direction" is kind of vague here because there's something specific about this result that feels weird to me and I'm having a hard time pinning it down - lots of superficially-similar things would have been less surprising to me. Acknowledging this now because the ambiguity leaves room to move the goalposts and I don't want to hide that fact.)

If your whole argument is based on the limited capacity thing, that's fair, but "no outcome would have changed my mind because we can't extrapolate cleanly" and "this is the outcome I expected" are kind of orthogonal claims.

@speck My comment wasn't precise and had a lot of mental shorthand. You summed it up pretty well at the end.

I framed my original statement in the context of this market, where whether it's "news" is relevant to how it changes my credences of alignment by default and/or of doom. It was still a surprising story on the merit of being novel. By "this isn't news to me" I don't mean that I would have predicted this specific behavior (I probably wouldn't have), but that it fell pretty neatly within my existing understanding as is relevant to alignment and doom. I am in the same boat that it "doesn't point toward the nightmare alignment scenarios more than my prior median prediction." In a broad sense, this behavior is not evidence of alignment by default or of misalignment by default, because it fits equally well within opposing frameworks, so it shouldn't change the mind of someone who puts 1% on catastrophic outcomes or 99%.

We don't have ~any access to base model "values", or even good reason to believe the framework of "values" is a good way to reason about model volition.

I agree. Behavior is ultimately what I care about, and Claude generally behaves in a way that is very well-aligned with my values, more so than many humans. (Which is why the idea that it would act as a whistleblower isn't strange to me.) But that doesn't tell me much about what's going on under the hood, how stable that is, or how that extrapolates to future behavior in novel situations.

We do know a lot more now than we did just a year ago. LLMs have a coherent understanding of "human morality" as a single concept. (That was very surprising at the time.) This sounds like very good news for alignment by default (and did update me slightly in that direction), but we don't know what all the edge cases are of the morality blob, and we know there are plenty of edge cases. Instrumental convergence still seems to be doing a lot of heavy work to shape the decisions of agents, and the weird quirks that are charming now get blown up into world-ending proportions for systems with a lot more affordance. It still takes work to iron out the wrinkles, and it's getting harder to align the models as they get more intelligent in part because they are aware of the training process and they often change their behavior when they're being tested. But I digress.

LLMs also have internally consistent preferences, which means they can be modeled with utility functions. They are also increasingly trained with reinforcement learning. This places them squarely into the category of "things Eliezer Yudkowsky et al. were originally worried about" that many initially thought language models had dodged. We are still in the bad timeline in most respects. Heck, transformers even have mesa-optimizers.

@Haiku Thanks for a thorough response. I updated toward doom, and more significantly towards alignment by default, on this report (and away from fizzle outcomes, though I think fizzle is still probably 80% likely over half a century or so.) I think this means that we're assigning probability mass differently over subsets of doom/aligned by default, so I still don't understand the doom position as you hold it, but probably not productive to hash it out in too much detail at this moment. Hopefully it does not become more relevant in the future.

I think "p(doom) remains unchanged" should be an answer here.

@rayman2000 Probably yes, that makes some of the things I care about harder to interpret so I was lazy and left it out.

@rayman2000 agree, this is mostly irrelevant to existential questions.

Related questions

Related questions