Former AI safety research engineer, now AI governance researcher at OpenAI. Blog:


Shaping safer goals
AGI safety from first principles

Wiki Contributions


Okay. I'm not sure what to tell you. This lands for me like "I don't understand how you think turning on the burner is related to the process of cooking the soup." Um… it just is? I already described the mechanisms, so I think the communication gap is somewhere I don't see.

I think you interpreted this as incredulity, whereas I meant it as "I don't understand the specific links" (e.g. is recognizing the illusion most of the work, or only a small part? What stops you from healing traumas without recognizing the illusion? etc). I've edited to clarify.

I don't understand how specifically you think the process of recognizing the illusion is related to the process of healing traumas. But I also object to ideas like "you need to orient towards your fear as an illusion first and only then do useful work", for roughly the same reasons (in particular, the way it's all-or-nothing). So I'll edit my original comment to clarify that this is a more central/less strawmanny objection.

I think the fact that the people most scared of AGI risk are also the ones most scared of not developing AGI should raise some eyebrows.

Very nice observation.

I think there's a bunch of useful stuff here. In particular, I think that decisions driven by deep-rooted fear are often very counterproductive, and that many rationalists often have "emergency mobilization systems" running in ways which aren't conducive to good long-term decision-making. I also think that paying attention to bodily responses is a great tool for helping fix this (and in fact was helpful for me in defusing annoyance when reading this post). But I want to push back on the way in which it's framed in various places as an all-or-nothing: exit the game, or keep playing. Get sober, or stay drunk. Hallucination, not real fear.

In fact, you can do good and important work while also gradually coming to terms with your emotions, trying to get more grounded, and noticing when you're making decisions driven by visceral fear and taking steps to fix that. Indeed, I expect that almost all good and important work throughout history has been done by people who are at various stages throughout that process, rather than people who first dealt with their traumas and only then turned to the work. (EDIT: in a later comment, Valentine says he doesn't endorse the claim that people should deal with traumas before doing the work, but does endorse the claim that people should recognize the illusion before doing the work. So better to focus on the latter (I disagree with both).)

(This seems more true for concrete research, and somewhat (but less) true for thinking about high-level strategy. In general it seems that rationalists spend way too much of their time thinking about high-level strategic considerations, and I agree with some of Valentine's reasoning about why this happens. Instead I'd endorse people trying be much more focused on making progress in a few concrete areas, rather than trying to track everything which they think might be relevant to AI risk. E.g. acceleration is probably bad, but it's fundamentally a second-order effect, and the energy focused on all but the biggest individual instances of acceleration would probably be better used to focus on first-order effects.)

In other words, I want to offer people the affordance to take on board the (many) useful parts of Valentine's post without needing to buy into the overall frame in which your current concerns are just a game, and your fear is just a manifestation of trauma.

(Relatedly, from my vantage point it seems that "you need to do the trauma processing first and only then do useful work" is a harmful self-propagating meme in a very similar way as "you need to track and control every variable in order for AI to go well". Both identify a single dominant consideration which requires your full focus and takes precedence over all others. However, I still think that the former is directionally correct for most rationalists, just as the latter is directionally correct for most non-rationalists.)

I believe this because of how the world looks "brittle" (e.g., nanotech exists) and because lots of technological progress seems cognition-constrained (such as, again, nanotech). This is a big part of why I think heavy-precedent-style justifications are doomed.

Apart from nanotech, what are the main examples or arguments you would cite in favor of these claims?

Separately, how close is your conception of nanotech to "atomically precise manufacturing", which seems like Drexler's preferred framing right now?

I agree that "buying time" isn't a very useful category. Some thoughts on the things which seem to fall under the "buying time" category:

  • Evaluations
    • I think people should mostly consider this as a subcategory of technical alignment work, in particular the work of understanding models. The best evaluations will include work that's pretty continuous with ML research more generally, like fine-tuning on novel tasks, developing new prompting techniques, and application of interpretability techniques.
  • Governance work, some subcategories of which include:
    • Lab coordination: work on this should mainly be done in close consultation with people already working at big AI labs, in order to understand the relevant constraints and opportunities
    • Policy work: see standard resources on this
    • Various strands of technical work which is useful for the above
  • Outreach
    • One way to contribute to outreach is doing logistics for outreach programs (like the AGI safety fundamentals course)
    • Another way is to directly engage with ML researchers
    • Both of these seem very different from "buying time" - or at least "outreach to persuade people to become alignment researchers" doesn't seem very different from "outreach to buy time somehow"

I don't understand what you mean by "previous reward functions".

I can't tell if you're being uncharitable or if there's a way bigger inferential gap than I think, but I do literally just mean... reward functions used previously. Like, people did reinforcement learning before RLHF. They used reward functions for StarCraft and for Go and for Atari and for all sorts of random other things. In more complex environments, they used curiosity and empowerment reward functions. And none of these are the type of reward function that would withstand much optimization pressure (except insofar as they only applied to domains simple enough that it's hard to actually achieve "bad outcomes").

Cool, makes sense.

I don't see any principled distinction between RLHF and other standard reinforcement-learning setups.

I think we disagree on how "principled" a method needs to be in order to constitute progress. RLHF gives rewards which can withstand more optimization before producing unintended outcomes than previous reward functions. Insofar as that's a key metric we care about, it counts as progress. I'd guess we'd both agree that better RLHF and also techniques like debate will further increase the amount of optimization our reward functions can withstand, and then the main crux is whether that's anywhere near the ballpark of the amount of optimzation they'll need to withstand in order to automate most alignment research.

When people 10 years ago started discussing the outer alignment problem (though with slightly different names), reinforcement learning was the classical example that was used to demonstrate why the outer alignment problem is a problem in the first place.

Got any sources for this? Feels pretty different if the problem was framed as "we can't write down a reward function which captures human values" versus "we can't specify rewards correctly in any way". And in general it's surprisingly tough to track down the places where Yudkowsky (or others?) said all these things.

Good question. I imagine the first head mostly being trained on existing data (e.g. text, videos) but then when it comes to data gathered by the network itself, my default story is that it'd be trained to output predictions conditional on actions, so that it's not duplicating the learning done by the action head. But this is all fairly speculative and either seems reasonable.

Load More