Steven Byrnes

I'm an AGI safety / AI alignment researcher in Boston with a particular focus on brain algorithms—see Email: Twitter: @steve47285. Employer: Physicist by training.


Intro to Brain-Like-AGI Safety

Wiki Contributions


One-size-fits-all introductions are hard; different people are going to have different backgrounds and preconceptions which call for different resources.

But to answer your question, if I had to pick one, in the absence of any specific information about who it’s for, I think I’d go with Ben Hilton’s 80,000 hours problem profile (August 2022).

You can do that using LeechBlock.

Wouldn’t that be “Optimizing for the output of a grader which evaluates plans”, where one of the items on the grading rubric is “This plan is in-distribution”?

I agree—I think “Optimizing for the output of a grader which evaluates plans” is more-or-less how human brains choose plans, and I don’t think it’s feasible to make an AGI that doesn’t do that.

But it sounds like this will be the topic of Alex’s next essay.

So I’m expecting to criticize Alex’s next essay by commenting on it along the lines of: “You think you just wrote an essay about something which is totally different from “Optimizing for the output of a grader which evaluates plans”, but I disagree; the thing you’re describing in this essay is in that category too.” But that’s just a guess; I will let Alex write the essay before I criticize it. :-P

Is there a reason you used the term “grader” instead of the AFAICT-more-traditional term “critic”? No big deal, I’m just curious.

I thought you could call a self-driving taxi without a backup driver, right now, but only in Phoenix and maybe one or two other cities?

if the AGI outputs the textbook of the future on alignment, I'd say we could understand that sufficiently well to be sure that our AI will be aligned/corrigible

It’s at least not obvious to me that we would be able to tell apart “the textbook of the future on alignment” from “an artifact that purports to be the textbook of the future on alignment, but is in fact a manipulative trick”. At least, I think we wouldn’t be able to tell them apart in the limit of superintelligent AI.

I mean, it’s not like I’ve never found an argument compelling and then realized much later on that it was wrong. Hasn’t everyone?

For the latter (toy environments) part, I think we need to distinguish a couple possible types of argument:

  1. Alignment is a property of a training procedure. I.e., the goal is to find a training procedure that will reliably build aligned models, in whatever environment we run it in. We run that training procedure in sandbox environments, and it always builds aligned models. Next, we run that same training procedure (from scratch) in the real world, and we should expect it to likewise build an aligned model.
  2. Alignment is a property of a particular trained model. So we train a model in a sandbox, and verify that it’s aligned (somehow), and then use that very same trained model in the real world.

And also:

  • A. We’re going to have strong theoretical reasons to expect alignment, and we’re going to use sandbox testing to validate those theories.
  • B. We’re going to have an unprincipled approach that might or might not create aligned models, and we’re going to use sandbox testing to explore / tweak specific trained models and/or explore / tweak the training approach.

I think Nate is talking about 2 & B, and you’re talking about 1 & (not sure about A vs B).

I think that 2 is fraught because “aligned” has a different meaning in a sandbox versus the real world. In the sandbox, an “aligned” model would be trying to help / empower / whatever the sandbox inhabitants, and in the real world, an “aligned” model would be trying to help / empower / whatever “humanity”.

I think that 1 is potentially fraught too, at least in the absence of A, in that it’s conceivable that we’d find a training procedure that will reliably build aligned models when run in sandboxes while reliably building misaligned models when run in the real world.

I think A is definitely what I’m hoping for, and that 1 & A would make me very optimistic. I think B is pretty fraught, again for the same reason as above—no sandbox is exactly the same as the real world, with “the likely absence of real live humans in the sandbox” being a particularly important aspect of that.

I think there’s a continuum between B and A, and the more we can move from B towards A, the better I feel.

And I think my own time is better spent on trying to move from B towards A, compared to thinking through how to make the most realistic sandboxes possible. But I’m happy for you and anyone else to be doing the latter. And I’m also strongly in favor of people building tools and culture to make it more likely that future AGI programmers will actually do sandbox testing—I have advocated for one aspect of that here.

There was some related discussion here, to the effect that we could do something to try to make the AGI as verbal a thinker as possible, IIUC. (I endorse that as plausibly a good idea worth thinking about and trying. I don’t see it as sufficient / airtight.)

The hope is that this same mechanism which seems well suited for handling imprinting also works for grounding sexual attraction (as an elaboration of imprinting) and then more complex concepts like representations of other's emotions from facial expression, vocal tone, etc proxies, and then combining that with empathic simulation to ground a model of other's values/utility for social game theory, altruism, etc.

Yes, that is my hope too! And the main thing I’m working on most days is trying to flesh out the details.

I do agree the amygdala does seem like a good fit for the location of the learned symbol circuit, although at that point it raises the question of why not also just have the proxy in the amygdala? If the amygdala has the required inputs from LGN and/or V1 it would be my guess that it could also just colocate the innate proxy circuit. (I haven't looked in the lit to see if those connections exist)

For example, I claim that all the vision-related inputs to the amygdala have at some point passed through at least one locally-random filter stage (cf. “pattern separation” in neuro literature or “compressed sensing” in DSP literature). That’s perfectly fine if the amygdala is just going to use those inputs as feedstock for an SL model. SL models don’t need to know a priori which input neuron is representing which object-level pattern, because it’s going to learn the connections, so if there’s some randomness involved, it’s fine. But the randomness would be a very big problem if the amygdala needs to use those input signals to calculate a ground-truth proxy.

As another example, a ground-truth proxy requires zero adjustable parameters (because how would you adjust them?), whereas a learning algorithm does well with as many adjustable parameters as possible, more or less.

So I see these as very different algorithmic tasks—so different that I would expect them to wind up in different parts of the brain, just on general principles.

The amygdala is a hodgepodge grouping of nuclei, some of which are “really” (embryologically & evolutionarily) part of the cortex, and the rest of which are “really” part of the striatum (ref). So if we’re going to say that the cortex and striatum are dedicated to running within-lifetime learning algorithms (which I do say), then we should expect the amygdala to be in that same category too.

By contrast, SC is in the brainstem, and if you go far enough back, SC is supposedly a cousin of the part of the pre-vertebrate (e.g. amphioxus) nervous system that implements a simple “escape circuit” by triggering swimming when it detects a shadow—in other words, a part of the brain that triggers an innate reaction based on a “hardcoded” type of pattern in visual input. So it would make sense to say that the SC is still more-or-less doing those same types of calculations.

Load More