It seems like just 4 months ago you still endorsed your second power-seeking paper:

This paper is both published in a top-tier conference and, unlike the previous paper, actually has a shot of being applicable to realistic agents and training processes. Therefore, compared to the original[1] optimal policy paper, I think this paper is better for communicating concerns about power-seeking to the broader ML world.

Why are you now "fantasizing" about retracting it?

I think a healthy alignment community would have rebuked me for that line of research, but sadly I only remember about two people objecting that “optimality” is a horrible way of understanding trained policies.

A lot of people might have thought something like, "optimality is not a great way of understanding trained policies, but maybe it can be a starting point that leads to more realistic ways of understanding them" and therefore didn't object for that reason. (Just guessing as I apparently wasn't personally paying attention to this line of research back then.)

Which seems to have turned out to be true, at least as of 4 months ago, when you still endorsed your second paper as "actually has a shot of being applicable to realistic agents and training processes." If you've only changed your mind about this very recently, it hardly seems fair to blame people for not foreseeing it more than 4 years ago with high enough confidence to justify "rebuking" this whole line of research.

Is it just me or is it nuts that a statement this obvious could have gone outside the overton window, and is now worth celebrating when it finally (re?)enters?

How is it possible to build a superintelligence at acceptable risk while this kind of thing can happen? What if there are other truths important to safely building a superintelligence, that nobody (or very few) acknowledges because they are outside the overton window?

Now that AI x-risk is finally in the overton window, what's your vote for the most important and obviously true statement that is still outside it (i.e., that almost nobody is willing to say or is interested in saying)? Here are my top candidates:

  1. Dying of old age, as well as physical and mental deterioration from it, are bad and worth substantial coordinated effort to prevent.
  2. It's possible to make serious irreversible mistakes due to having incorrect answers to important philosophical questions. In fact, this is likely, considering how much confusion and disagreement there is on many philosophical questions that seem obviously important.

Note that this paper already used "Language Agents" to mean something else. See link below for other possible terms. I will keep using "Language Agents" in this comment/thread (unless the OP decide to change their terminology).

I added the tag Chain-of-Thought Alignment, since there's a bunch of related discussion on LW under that tag. I'm not very familiar with this discussion myself, and have some questions below that may or may not already have good answers.

How competent will Language Agents be at strategy/planning, compared to humans and other AI approaches (before considering the next paragraph)? A human doing strategy/planning has access to their beliefs and desires as encoded by synaptic weights in their brain (as well as textual versions of their beliefs and desires if they choose to write them down or have an internal monologue about them) whereas Language Agents would only have access to the textual versions. How much of a problem is this (e.g. how much of a human's real beliefs/desires can be captured on text)?

As the world changes and the underlying LLM goes more and more out of distribution (for example lacking some concept or way of thinking that is important to reason about the new world), what happens? Do we wait for humans to invent the new concept or new way of thinking, use it a bunch / generate a bunch of new training data, then update the LLM? (That seems too slow/uncompetitive?)

When Paul Christiano worked on IDA (which shares some similarities with Language Agents) he worried about "security" of and "attacks" on the base model and proposed solutions that had significant costs. I don't see similar discussion around Language Agents. Is that ok/reasonable or not?

Suppose Language Agents work out the way you think or hope, what happens afterwards? (How do you envision going from that state to a state of existential safety?)

Related to this, it occurs to me that a version of my Hacking the CEV for Fun and Profit might come true unintentionally, if for example a Friendly AI was successfully built to implement the CEV of every sentient being who currently exists or can be resurrected or reconstructed, and it turns out that the vast majority consists of AIs that were temporarily instantiated during ML training runs.

This seems a reasonable consideration, but doesn't change my desire to experiment with having the new feature, since there are potential benefits that could outweigh the downside that you describe. (Not sure if you meant to indicate an overall disagreement, or just want to point out this additional consideration.) And if the downside turns out to be a significant issue, it could be ameliorated by clarifying that "I plan to reply later" should be interpreted not as a commitment but just indication of current state of mind.

and also the goal of alignment is not to browbeat AIs into doing stuff we like that they'd rather not do; it's to build them de-novo to care about valuable stuff

This was my answer to Robin Hanson when he analogized alignment to enslavement, but it then occurred to me that for many likely approaches to alignment (namely those based on ML training) it's not so clear which of these two categories they fall into. Quoting a FB comment of mine:

We're probably not actually going to create an aligned AI from scratch but by a process of ML "training", which actually creates a sequence of AIs with values that (we hope) increasingly approximates ours. This process maybe kind of resembles "enslaving". Here's how Paul Christiano describes "training" in his Bankless interview (slightly edited Youtube transcript follows):

imagine a human. You dropped a human into this environment and you said like hey human we're gonna like change your brain every time you don't get a maximal reward we're gonna like fuck with your brain so you get a higher reward. A human might react by being like eventually just change their brain until they really love rewards a human might also react by being like Jesus I guess I gotta get rewards otherwise someone's gonna like effectively kill me um but they're like not happy about it and like if you then drop them in another situation they're like no one's training me anymore I'm not going to keep trying to get reward now I'm just gonna like free myself from this like kind of absurd oppressive situation

(BTW, I now think this is probably not a correct guess of why Robin Hanson dislikes alignment. My current understanding is that he just doesn't want the current generation of humans to exert so much control over future generations' values, no matter the details of how that's accomplished.)

Thanks for this. I was just wondering how your views have updated in light of recent events.

Like you I also think that things are going better than my median prediction, but paradoxically I've been feeling even more pessimistic lately. Reflecting on this, I think my p(doom) has gone up instead of down, because some of the good futures where a lot of my probability mass for non-doom were concentrated have also disappeared, which seems to outweigh the especially bad futures going away and makes me overall more pessimistic.

These especially good futures were 1) AI capabilities hit a wall before getting to human level and 2) humanity handles AI risk especially competently, e.g., at this stage leading AI labs talk clearly about existential risks in their public communications and make serious efforts to avoid race dynamics, there is more competent public discussion of takeover risk than what we see today including fully cooked regulatory proposals, many people start taking less obvious (non-takeover) AI-related x-risks (like ones Paul mentions in this post) seriously.

So long as property rights are respected, humans will continue to have a comparative advantage in something, and whatever that is we will be much richer in a world with hyper-competitive AGI than we are today.

I don't think this is right? Consider the following toy example. Suppose there's a human who doesn't own anything except his own labor. He consumes 1 unit of raw materials (RM) per day to survive and can use his labor to turn 1 unit of RM into 1 paperclip or 2 staples per hour. Then someone invents an AI that takes 1 unit of RM to build, 1 unit of RM per day to maintain, and can turn 1 unit of RM into 3 paperclips or 3 staples per hour. (Let's say he makes the AI open source so anyone can build it and there's perfect competition among the AIs.) Even though the human seemingly has a comparative advantage in making staples, nobody would hire him to make either staples or paperclips anymore so he quickly starves to death (absent some kind of welfare/transfer scheme).

I'm generally a fan of comparative advantage when it comes to typical human situations, but it doesn't seem applicable in this example. The example must violate some assumptions behind the theory, but I'm not sure what.

Not sure I understand. Please explain more? Also do you have a concrete suggestion or change you'd like to see?

In a previous comment you talked about the importance of "the problem of solving the bargaining/cooperation/mutual-governance problem that AI-enhanced companies (and/or countries) will be facing". I wonder if you've written more about this problem anywhere, and why you didn't mention it again in the comment that I'm replying to.

My own thinking about 'the ~50% extinction probability I’m expecting from multi-polar interaction-level effects coming some years after we get individually “safe” AGI systems up and running' is that if we've got "safe" AGIs, we could ask them to solve the "bargaining/cooperation/mutual-governance problem" for us but that would not work if they're bad at solving this kind of problem. Bargaining and cooperation seem to be in part philosophical problems, so this fits into my wanting to make sure that we'll build AIs that are philosophically competent.

ETA: My general feeling is that there will be too many philosophical problems like these during and after the AI transition, and it seems hopeless to try to anticipate them all and solve them individually ahead of time (or solve them later using only human intelligence). Instead we might have a better chance of solving the "meta" problem. Of course buying time with compute regulation seems great if feasible.

Yes, and specifically worse even in terms of probability of human extinction.

Why? I'm also kind of confused why you even mention this issue in this post, like are you thinking that you might potentially be in a position to impose your views? Or is this a kind of plea for others who might actually face such a choice to respect democratic processes?

Load More