Alex Turner, postdoctoral researcher at the Center for Human-Compatible AI. Reach me at turner.alex[at]berkeley[dot]edu.


Interpreting a Maze-Solving Network
Thoughts on Corrigibility
The Causes of Power-seeking and Instrumental Convergence
Reframing Impact
Becoming Stronger


I want to note that I just reread Utility ≠ Reward and was pleasantly surprised by its treatment, as well as the hedges. I'm making an upwards update on these points having been understood by at least some thinkers, although I've also made a lot of downward updates for other reasons.

(Huh, I never saw this -- maybe my weekly batched updates are glitched? I only saw this because I was on your profile for some other reason.)

I really appreciate these thoughts!

But you then propose an RL scheme. It seems to me like it's still a useful form of critique to say: here are the upward errors in the proposed rewards, here is the policy that would exploit them.

I would say "that isn't how on-policy RL works; it doesn't just intelligently find increasingly high-reinforcement policies; which reinforcement events get 'exploited' depends on the exploration policy." (You seem to guess that this is my response in the next sub-bullets.)

While I find the particular examples intuitive, the overall claim seems too good to be true: effectively, that the path-dependencies which differentiate GD learning from ideal Bayesian learning are exactly the tool we need for alignment. 

shrug, too good to be true isn't a causal reason for it to not work, of course, and I don't see something suspicious in the correlations. Effective learning algorithms may indeed have nice properties we want, especially if some humans have those same nice properties due to their own effective learning algorithms! 

The model has been shaped to maximize its reward by any means necessary[2], even if it means suddenly delivering an invitation to a wedding party. This is weak evidence towards the "playing the training game" scenario.

This conclusion seems unwarranted. What we have observed is (Paul claiming the existence of) an optimized model which ~always brings up weddings. On what basis does one infer that "the model has been shaped to maximize its reward by any means necessary"? This is likewise not weak evidence for playing the training game. 

Thanks for the reply. This comment is mostly me disagreeing with you.[1] But I really wish someone had said the following things to me before I spent thousands of hours thinking about optimal policies. 

I agree that learning a goal from the training-compatible set is a strong assumption that might not hold. 

My point is not just that this post has made a strong assumption which may not hold. My point is rather that these results are not predictive because the assumption won't hold. The assumptions are already known to not be good approximations of trained policies, in at least some prototypical RL situations. I also think that there is not good a priori reason to have expected "training-compatible" "goals" to be learned. According to me, "learning and optimizing a reward function" is both muddy communication and doesn't actually seem to happen in practice. 

This post assumes a standard RL setup and is not intended to apply to LLMs

I don't see any formal assumption which excludes LLM finetuning. Which assumption do you think should exclude them?

I agree that reward functions are not the best way to refer to possible goals. This post builds on the formalism in the power-seeking paper which is based on reward functions, so it was easiest to stick with this terminology.

This point is, in large part, my fault. As I argued in my original comment, this terminology makes readers actively worse at reasoning about realistic trained systems. I regret each of the thousands of hours I spent on the power-seeking work, and sometimes fantasize about retracting one or both papers.

I can talk about utility functions instead (which would be equivalent to value functions in this case)

I disagree that these are equivalent, and expect the policy and value function to come apart in practice. Indeed, that was observed in the original goal misgeneralization paper (3.3, actor-critic inconsistency). 

Anyways, we can talk about utility functions, but then we're going to lose claim to predictiveness, no? Why should we assume that the network will internally represent a scalar function over observations, consistent with a historical training signal's scalar values (and let's not get into nonstationary reward), such that the network will maximize discounted sum return of this internally represented function? That seems highly improbable to me, and I don't think reality will be "basically that" either.

I think it is pretty clear in the post that I'm not talking about reinforcement functions and the training reward is not the optimization target, but I could clarify this further if needed.

I agree that you don't assume the network will optimize the training reward. But that's not the critique I intended to communicate. The post wrote (emphasis added):

Suppose an agent is trained using reinforcement learning with reward function . We assume that the agent learns a goal during the training process: a set of internal representations of favored and disfavored outcomes. For simplicity, we assume this is equivalent to learning a reward function, which is not necessarily the same as the training reward function . We consider the set of reward functions that are consistent with the training rewards received by the agent, in the sense that agent's behavior on the training data is optimal for these reward functions. We call this the training-compatible goal set, and we expect that the agent is most likely to learn a reward function from this set.

This is talking about the reward/reinforcement function , no? And assuming that the policy will be optimal on training? As I currently understand it, this post makes unsupported and probably-wrong claims/assumptions about the role and effect of the reinforcement function. (EG assuming that using a reinforcement function on the network, means that the network learns an internally represented reinforcement function which it maximizes and which is consistent with historically observed reinforcements)

I think utility functions are still the best formalism we have to represent goals, and I don't have a clear sense of the alternative you are proposing.

To be clear, I'm not proposing an alternative formalism. None of my comment intended to make positive shard theory claims. Whether or not we know of an alternative formalism, I currently feel confident that your results are not predictive and furthermore cast RL in an unrealistic light. This is inconvenient since I don't have a better formalism to suggest, but I think it's still true.

  1. ^

    ETA: For the record, I upvoted both of your replies to me in this thread, and appreciate your engagement and effort. 

I still expect instrumental convergence from agentic systems with shard-encoded goals, but think this post doesn't offer any valid argument for that conclusion. 

I don't think these results cover the shard case. I don't think reward functions are good ways of describing goals in settings I care about. I also think that realistic goal pursuit need not look like "maximize time-discounted sum of a scalar quantity of world state." 

My point is not that instrumental convergence is wrong, or that shard theory makes different predictions. I just think that these results are not predictive of trained systems. 

I regret each of the thousands of hours I spent on my power-seeking theorems, and sometimes fantasize about retracting one or both papers. I am pained every time someone cites "Optimal policies tend to seek power", and despair that it is included in the alignment 201 curriculum. I think this work makes readers actively worse at thinking about realistic trained systems.

I think a healthy alignment community would have rebuked me for that line of research, but sadly I only remember about two people objecting that "optimality" is a horrible way of understanding trained policies. 

You can use ChatGPT without helping train future models:

What if I want to keep my history on but disable model training?

...you can opt out from our use of your data to improve our services by filling out this form. Once you submit the form, new conversations will not be used to train our models.

Beforehand I was very confident that vector additions would work here, even though I knew that the fully connected additions didn't work. Before showing him the results, but after showing the results for the fully connected network, I asked TurnTrout for his prediction. He gave 85% that the additions would work.

I want to clarify that I had skimmed the original results and concluded that they "worked" in that 3-1 vectors got e.g. 1s to be classified as 3s. (This is not trivial, since not all 1 activations are the same!) However, those results "didn't work" in that they destroyed performance on non-1 images. 

I thought I was making predictions on whether 3-1 vectors get 1s to be classified as 3s by this residual network. I guess I'm going to mark my prediction here as "ambiguous", in that case.

I think that capacity would be really nice. I think our results are maybe a very very rough initial version of that capacity. I want to caution that we should be very careful about making inferences about what concepts are actually used by the model. From a footnote:

Of course, there need not be a "wedding" feature direction in GPT-2-XL. What we have observed is that adding certain activation vectors will reliably produce completions which appear to us to be "more about weddings." This could take place in many ways, and we encourage people to avoid instantly collapsing their uncertainty about how steering vectors work. 

Load More