Generalised models
Concept Extrapolation
AI Safety Subprojects
Practical Guide to Anthropics
Anthropic Decision Theory
Subagents and impact measures
If I were a well-intentioned AI...


A good review of work done, which shows that the writer is following their research plan and following up their pledge to keep the community informed.

The contents, however, are less relevant, and I expect that they will change as the project goes on. I.e. I think it is a great positive that this post exists, but it may not be worth reading for most people, unless they are specifically interested in research in this area. They should wait for the final report, be it positive or negative.

I have looked at it, but ignored it when commenting on this post, which should stand on its own (or as part of a sequence).

It's rare that I encounter a lesswrong post that opens up a new area of human experience - especially rare for a post that doesn't present an argument or a new interpretation or schema for analysing the world.

But this one does. A simple review, with quotes, of an ethnographical study of late 19th century Russian peasants, opened up a whole new world and potentially changed my vision of the past.

Worth it from its many book extracts and choice of subject matter.

Fails to make a clear point; talks about the ability to publish in the modern world, then brushes over cancel culture, immigration, and gender differences. Needs to make a stronger argument and back it up with evidence.

A decent introduction to the natural abstraction hypothesis, and how testing it might be attempted. A very worthy project, but it isn't that easy to follow for beginners, nor does it provide a good understanding of how the testing might work in detail. What might consist a success, what might consist a failure of this testing? A decent introduction, but only an introduction, and it should have been part of a sequence or a longer post.

Can you clarify: are you talking about inverting the LM as a function or algorithm, or constructing prompts to elicit different information (while using the LM as normal)?

For myself, I was thinking of using CHATGPT-style approaches with multiple queries - what is your prediction for their preferences, how could that prediction be checked, what more information would you need, etc...

Another reason to not expect the selection argument to work is that it’s instrumentally convergent for most inner agent values to not become wireheaders, for them to not try hitting the reward button. [...] Therefore, it decides to not hit the reward button.

I think that subsection has the crucial insights from your post. Basically you're saying that, if we train an agent via RL in a limited environment where the reward correlates with another goal (eg "pick up the trash"), there are multiple policies the agent could have, multiple meta-policies it could have, multiple ways it could modify or freeze its own cognition, etc... Whatever mental state it ultimately ends up with, the only constraint is that this state must be compatible with the reward signal in that limited environment.

Thus "always pick up trash" is one possible outcome; "wirehead the reward signal" is another. There are many other possibilities, with different generalisations of the initial reward-signal-in-limited-environment data.

I'd first note that a lot of effort in RL is put specifically into generalising the agent's behaviour. The more effective this becomes, the closer the agent will be to the "wirehead the reward signal" side of things.

Even without this, this does not seem to point towards ways of making AGI safe, for two main reasons:

  1. We are relying on some limitations of the environment or the AGI's design, to prevent it from generalising to reward wireheading. Unless we understand what these limitations are doing in great detail, and how it interacts with the reward, we don't know how or when the AGI will route around them. So they're not stable or reliable.
  2. The most likely attractor for the AGI is "maximise some correlate of the reward signal". An unrestricted "trash-picking up" AGI is just as dangerous as a wireheading one; indeed, one could see it as another form of wireheading. So we have no reason to expect that the AGI is safe.

If the system that's optimising is separate from the system that has the linguistic output, then there's a huge issue with the optimising system manipulating or fooling the linguistic system - another kind of "symbol grounding failure".

Load More