Instructor at Center for Applied Rationality
Gwern, I'm curious whether you would guess that something like mesa-optimization, broadly construed, is happening in GPT-3?
This post primarily argues that a phenomenon is evidence for [learned models being likely to encode search algorithms]
I do mention interpreting the described results "as tentative evidence" about mesa-optimization at the end of the post, and this interpretation was why I wrote the post; fwiw, my impression remains that this interpretation is correct. But the large majority of the post is just me repeating or paraphrasing claims made by DeepMind researchers, rather than making claims myself; I wrote it this way intentionally, since I didn't feel I had sufficient domain knowledge to assess the researchers' claims well myself.
I feel confused about why, given your model of the situation, the researchers were surprised that this phenomenon occurred, and seem to think it was a novel finding that it will inevitably occur given the three conditions described. Above, you mentioned the hypothesis that maybe they just "weren't very familiar with AI." Looking at the author list, and at their publications (1, 2, 3, 4, 5, 6, 7, 8), this seems implausible to me. While most of the eight co-authors are neuroscientists by training, three have CS degrees (one of whom is Demis Hassabis), and all but one have co-authored previous ML papers. It's hard for me to imagine their surprise was due simply to them lacking basic knowledge about RL?
And this OpenAI paper (whose authors I think you would describe as familiar with ML), which the summary of Wang et al. on the DeepMind website describes as "closely related work," and which appears to me to describe a very similar setup, describes their result in similar terms:
We structure the agent as a recurrent neural network, which receives past rewards, actions, and termination flags as inputs in addition to the normally received observations. Furthermore, its internal state is preserved across episodes, so that it has the capacity to perform learning in its own hidden activations. The learned agent thus also acts as the learning algorithm, and can adapt to the task at hand when deployed.
The OpenAI authors also seem to me to think they can gather evidence about the structure of the algorithm simply by looking at its behavior. Given a similar series of experiments (mostly bandit tasks, but also a maze solver), they conclude:
the dynamics of the recurrent network come to implement a learning algorithm entirely separate from the one used to train the network weights... the procedure the recurrent network implements is itself a full-fledged reinforcement learning algorithm, which negotiates the exploration-exploitation tradeoff and improves the agent’s policy based on reward outcomes... this learned RL procedure can differ starkly from the algorithm used to train the network’s weights.
They then run an experiment designed specifically to distinguish whether meta-RL was giving rise to a model-free system, or “a model-based system which learns an internal model of the environment and evaluates the value of actions at the time of decision-making through look-ahead planning,” and suggest the evidence implies the latter. This sounds like a description of search to me—do you think I'm confused?
I get the impression from your comments that you think it's naive to describe this result as "learning algorithms spontaneously emerge." You describe the lack of LW/AF pushback against that description as "a community-wide failure," and mention updating as a result toward thinking AF members “automatically believe anything written in a post without checking it.”
But my impression is that OpenAI describes their similar result in basically the same way. Do you think my impression is wrong? Or e.g. that their description is also misleading?
I've been feeling very confused lately about how people talk about "search," and have started joking that I'm a search panpsychist. Lots of interesting phenomenon look like piles of thermostats when viewed from the wrong angle, and I worry the conventional lens is deceptively narrow.
That said, when I condition on (what I understand to be) the conventional understanding, it's difficult for me to imagine how e.g. the maze-solver described in the OpenAI paper reliably and quickly locates the exit to new mazes, without doing something reasonably describable as searching for them.
And it seems to me that Wang et al. should be taken as evidence that "learning algorithms producing other search-performing learning algorithms" is convergently useful/likely to be a common feature of future systems, even if you don't think that's what happened in their paper, assuming you assign some credence to their hypothesis that this is what's going on in PFC, and to the hypothesis that search occurs in PFC.
If the primary difference between the DeepMind and OpenAI meta-RL architecture and the PFC/DA architecture is scale, then I think there's reasonable reason to suspect that something much like mesa-optimization will emerge in future meta-RL systems, even if it hasn't yet. That is, I interpret this result as evidence for the hypothesis that highly competent general-ish learners might tend to exhibit this feature, since (among other reasons) it increased my credence that it is already exhibited by the only existing member of that reference class.
Upthread, Evan mentions agreeing that this result is "not new evidence in favor of mesa-optimization." But he also mentions that Risks from Learned Optimization references these two papers, describing them as "the closest to producing mesa-optimizers of any existing machine learning research." I feel confused about how to reconcile these two claims. I didn't realize these papers were mentioned in Risks from Learned Optimization, but if I had, I think I would have been even more inclined to post this/try to ensure people knew about the results, since my (perhaps naive, perhaps not understanding ways this is disanalogous) prior is that the closest existing example to this problem might provide evidence about its nature or likelihood.
In college, people would sometimes discuss mu-eliciting questions like, "What does it mean to be human?"
I came across this line in a paper tonight and laughed out loud, imagining it as an answer:
"Maximizing this objective is equivalent to minimizing the cumulative pseudo-regret."
I appreciate you writing this, Rohin. I don’t work in ML, or do safety research, and it’s certainly possible I misunderstand how this meta-RL architecture works, or that I misunderstand what’s normal.
That said, I feel confused by a number of your arguments, so I'm working on a reply. Before I post it, I'd be grateful if you could help me make sure I understand your objections, so as to avoid accidentally publishing a long post in response to a position nobody holds.
I currently understand you to be making four main claims:
Does this summary feel like it reasonably characterizes your objection?
That gwern essay was helpful, and I didn't know about it; thanks.
The scenario I had in mind was one where death occurs as a result of damage caused by low food consumption, rather than by suicide.
One way catastrophic alignment in this sense is difficult for humans is that the PFC cannot divorce itself from the DA; I'd expect that a failure mode leading to systematically low DA rewards would usually be corrected
I'm not sure such divorce is all that rare. For example, anorexia sometimes causes people to find food anti-rewarding (repulsive/inedible, even when they're dying and don't wish to), and I can imagine that being because PFC actually somehow alters DAs reward function.
That said, I do share the hunch that something like a "divorce resistance" trick occurs and is helpful. I took Kaj and Steve to be gesturing at something similar elsewhere in the thread. But I notice feeling confused about how exactly this trick works. Does it scale...?
I have the intuition that it doesn't—that as the systems increase in power, divorce will occur more easily. That is, I have the intuition that if PFC were trying, so to speak, to divorce itself from DA supervision, that it could probably find some easy-ish way to succeed, e.g. by reconfiguring itself to hide activity from DA, or to send reward-eliciting signals to DA regardless of what goal it was pursuing.
I think it makes more sense to operationalize "catastrophic" here as "leading to systematically low DA reward
Thanks—I feel pretty convinced that this operationalization makes more sense than the one I proposed.
That's a really interesting point, and I hadn't considered it. Thanks!