Steven Byrnes

I'm an AGI safety / AI alignment researcher in Boston with a particular focus on brain algorithms. Research Fellow at Astera. See for a summary of my research and sorted list of writing. Email: Also on Twitter, Mastodon, Threads. Physicist by training.


Intro to Brain-Like-AGI Safety

Wiki Contributions



I think in much much simpler animals, valence is a literal specific signal in the brain, basically the collective spiking activity of a population of dopamine neurons. In mammals, that’s still sorta-close-to-true, but I would need to add a whole bunch of caveats and footnotes to that, for reasons hinted at in §1.5.6–1.5.7.

(I have a bunch of idiosyncratic opinions about what exactly the basal ganglia is doing and how, but I don’t want to get into it here, sorry!)

I reject both the “first” and the “second” thing you mention. I’m much closer to “valence is pretty straightforwardly encoded by spikes going down specific known axons”.

Separately, I might or might not agree with “the neural bases of emotions are widely distributed”, depending on how we define the word “emotions” (and also how we define “neural bases”, I suppose!), see here.

There's something to that, but this sounds too strong to me. If someone had hypothetically spent a year observing all of my behavior, having some sort of direct read access to what was happening in my mind, and also doing controlled experiments where they reset my memory and tested what happened with some different stimulus... it's not like all of their models would become meaningless the moment I read the morning newspaper. If I had read morning newspapers before, they would probably have a pretty good model of what the likely range of updates for me would be.

I dunno, I wrote “invalid (or at least, open to question)”. I don’t think that’s too strong. Like, just because it’s “open to question”, doesn’t mean that, upon questioning it, we won’t decide it’s fine. I.e., it’s not that the conclusion is necessarily wrong, it’s that the original argument for it is flawed.

Of course I agree that the morning paper thing would probably be fine for humans, unless the paper somehow triggered an existential crisis, or I try a highly-addictive substance while reading it, etc.  :)

Some relevant context is: I don’t think it’s realistic to assume that, in the future, AI models will be only slightly fine-tuned in a deployment-specific way. I think the relevant comparison is more like “can your values change over the course of years”, not “can your values change after reading the morning paper?”

Why do I think that? Well, let’s imagine a world where you could instantly clone an adult human. One might naively think that there would be no more on-the-job learning ever. Instead, (one might think), if you want a person to help with chemical manufacture, you open the catalog to find a human who already knows chemical manufacturing, and order a clone of them; and if you want a person to design widgets, you go to a different catalog page, and order a clone of a human widget design expert; so on.

But I think that’s wrong.

I claim there would be lots of demand to clone a generalist—a person who is generally smart and conscientious and can get things done, but not specifically an expert in metallurgy or whatever the domain is. And then, this generalist would be tasked with figuring out whatever domains and skills they didn’t already have.

Why do I think that? Because there’s just too many possible specialties, and especially combinations of specialties, for a pre-screened clone-able human to already exist in each of them. Like, think about startup founders. They’re learning how to do dozens of things. Why don’t they outsource their office supply questions to an office supply expert, and their hiring questions to a hiring expert, etc.? Well they do to some extent, but there are coordination costs, and more importantly the experts would lack all the context necessary to understand what the ultimate goals are. What are the chances that there’s a pre-screened clone-able human that knows about the specific combination of things that a particular application needs (rural Florida zoning laws AND anti-lock brakes AND hurricane preparedness AND …)

So instead I expect that future AIs will eventually do massive amounts of figuring-things-out in a nearly infinite variety of domains, and moreover that the figuring out will never end. (Just as the startup founder never stops needing to learn new things, in order to succeed.) So I don’t like plans where the AI is tested in a standardized way, and then it’s assumed that it won’t change much in whatever one of infinitely many real-world deployment niches it winds up in.

I find your text confusing. Let’s go step by step.

  • AlphaZero-chess has a very simple reward function: +1 for getting checkmate, -1 for opponent checkmate, 0 for draw
  • A trained AlphaZero-chess has extraordinarily complicated “preferences” (value function) — its judgments of which board positions are good or bad contain millions of bits of complexity
  • If you change the reward function (e.g. flip the sign of the reward), the resulting trained model preferences would be wildly different.

By analogy:

  • The human brain has a pretty simple innate reward function (let’s say dozens to hundreds of lines of pseudocode).
  • A human adult has extraordinarily complicated preferences (let’s say millions of bits of complexity, at least)
  • If you change the innate reward function in a newborn, the resulting adult would have wildly different preferences than otherwise.

Do you agree with all that?

If so, then there’s no getting around that getting the right innate reward function is extremely important, right?

So, what reward function do you propose to use for a brain-like AGI? Please, write it down. Don’t just say “the reward function is going to be learned”, unless you explain exactly how it’s going to be learned. Like, what’s the learning algorithm for that? Maybe use AlphaZero as an example for how that alleged reward-function-learning-algorithm would work? That would be helpful for me to understand you.  :)

You correctly mention that not all AI risk is solved by AI control being easy, because AI misuse can still be a huge factor

It’s odd that you understood me as talking about misuse. Well, I guess I’m not sure how you’re using the term “misuse”. If Person X doesn’t follow best practices when training an AI, and they wind up creating an out-of-control misaligned AI that eventually causes human extinction, and if Person X didn’t want human extinction (as most people don’t), then I wouldn’t call that “misuse”. Would you? I I would call it a “catastrophic accident” or something like that. I did mention in the OP that some people think human extinction is perfectly fine, and I guess if Person X is one of those people, then it would be misuse. So I suppose I brought up both accidents and misuse.

Misuse focused policy probably looks less technical, and more normal, for example Know Your Customer laws or hashing could be extremely important if we're worried about misuse of AI for say bioterrorism.

People who I think are highly prone to not following best practices to keep AI under control, even if such best practices exist, include people like Yann LeCun, Larry Page, Rich Sutton, and Jürgen Schmidhuber, who are either opposed to AI alignment on principle, or are so bad at thinking about the topic of AI x-risk that they spout complete and utter nonsense. (example). That’s not a problem solvable by Know Your Customer laws, right? These people (and many more) are not the customers, they are among the ones doing state-of-the-art AI R&D.

In general, the more people are technically capable of making an out-of-control AI agent, the more likely that one of them actually will, even if best practices exist to keep AI under control. People like to experiment with new approaches, etc., right? And I expect the number of such people to go up and up, as algorithms improve etc. See here.

If KYC laws aren’t the answer, what is? I don’t know. I’m not advocating for any particular policy here.

I agree with some of this, but I'd say Story 1 applies only very weakly, and that the majority/supermajority of value learning is online, for example via the self-learning/within lifetime-RL algorithms you describe, without relying on the prior. In essence, I agree with the claim that the genes need to impose a prior, which prevents pure blank-slatism from working. I disagree with the claim that this means that genetics need to impose a very strong prior without relying on the self-learning algorithms you describe for capabilities.

You keep talking about “prior” but not mentioning “reward function”. I’m not sure why. For human children, do you think that there isn’t a reward function? Or there is a reward function but it’s not important? Or do you take the word “prior” to include reward function as a special case?

If it’s the latter, then I dispute that this is an appropriate use of the word “prior”. For example, you can train AlphaZero to be superhumanly skilled at winning at Go, or if you flip the reward function then you’ll train AlphaZero to be superhumanly skilled at losing at Go. The behavior is wildly different, but is the “prior” different? I would say no. It’s the same neural net architecture, with the same initialization and same regularization. After 0 bits of training data, the behavior is identical in each case. So we should say it’s the same “prior”, right?

(As I mentioned in the OP, on my models, there is a human innate reward function, and it’s absolutely critical to human prosocial behavior, and unfortunately nobody knows what that reward function is.)

Neither this post nor the essay it’s responding to is about policy proposals. So why talk about it? Two points:

  • As a general principle, if there are two groups who wildly disagree about the facts on the ground, but nevertheless (coincidentally) agree about what policies they favor, then I say they should still probably try to resolve their disagreements if possible, because it’s generally good to have accurate beliefs, e.g. what if both of them are wrong? And maybe that coincidence will not always be true anyway.
  • It’s not true that the only choice on offer is “Should we ever build ASI? Yes or no?” In fact, that choice (per se) is not on offer at all. What there is, is a gazillion conceivable laws that could be passed, all of which have a wide and idiosyncratic array of intended and unintended consequences. Beyond that, there are a gazillion individual decisions that need to be made, like what careers to pursue, what to donate to, whether to publish or not publish particular things, whether to pursue or not pursue particular lines of research, etc. etc. I find it extraordinarily unlikely that, if Person A thinks p(doom)=99% and Person B thinks p(doom)=1%, then they’re going to agree on all these gazillions of questions. (And empirically, it seems to be clearly not the case that the p(doom)=1% people and the p(doom)=99% people agree on questions of policy.)

While it's obviously true that there is a lot of stuff operating in brains besides LLM-like prediction, such as mechanisms that promote specific predictive models over other ones, that seems to me to only establish that "the human brain is not just LLM-like prediction", while you seem to be saying that "the human brain does not do LLM-like prediction at all". (Of course, "LLM-like prediction" is a vague concept and maybe we're just using it differently and ultimately agree.)

I disagree with whether that distinction matters:

I think technical discussions of AI safety depend on the AI-algorithm-as-a-whole; I think “does the algorithm have such-and-such component” is not that helpful a question.

So for example, here’s a nightmare-scenario that I think about often:

  • (step 1) Someone reads a bunch of discussions about LLM x-risk
  • (step 2) They come down on the side of “LLM x-risk is low”, and therefore (they think) it would be great if TAI is an LLM as opposed to some other type of AI
  • (step 3) So then they think to themselves: Gee, how do we make LLMs more powerful? Aha, they find a clever way to build an AI that combines LLMs with open-ended real-world online reinforcement learning or whatever.

Even if (step 2) is OK (which I don’t want to argue about here), I am very opposed to (step 3), particularly the omission of the essential part where they should have said “Hey wait a minute, I had reasons for thinking that LLM x-risk is low, but do those reasons apply to this AI, which is not an LLM of the sort that I'm used to, but rather it’s a combination of LLM + open-ended real-world online reinforcement learning or whatever?” I want that person to step back and take a fresh look at every aspect of their preexisting beliefs about AI safety / control / alignment from the ground up, as soon as any aspect of the AI architecture and training approach changes, even if there’s still an LLM involved.  :)

I've written up an opinionated take on someone else's technical alignment agenda about three times, and each of those took me something like 100 hours. That was just to clearly state why I disagreed with it; forget about resolving our differences :)

For what it’s worth, I am not doing (and have never done) any research remotely similar to your text “maybe we can get really high-quality alignment labels from brain data, maybe we can steer models by training humans to do activation engineering fast and intuitively”.

I have a concise and self-contained summary of my main research project here (Section 2).

Update: I kinda regret this comment. I think when I wrote it I didn’t realize quite how popular the “Let’s figure out what Q* is!!” game is right now. It’s everywhere, nonstop.

It still annoys me as much as ever that so many people in the world are playing the “Let’s figure out what Q* is!!” game. But as a policy, I don’t ordinarily complain about extremely widespread phenomena where my complaint has no hope of changing anything. Not a good use of my time. I don’t want to be King Canute yelling at the tides. I un-downvoted. Whatever.

Load More