Nathan Helm-Burger

AI alignment researcher, ML engineer. Masters in Neuroscience.

I believe that cheap and broadly competent AGI is attainable and will be built soon. This leads me to have timelines of around 2024-2027. I think the best path forward to alignment is through safe, contained testing on models designed from the ground up for alignability trained on censored data (simulations with no mention of humans or computer technology). I think that current ML mainstream technology is close to a threshold of competence beyond which it will be capable of recursive self-improvement, and I think that this automated process will mine neuroscience for insights, and quickly become far more effective and efficient. I think it would be quite bad for humanity if this happened in an uncontrolled, uncensored, un-sandboxed situation. So I am trying to warn the world about this possibility. See my prediction market here: 

Wiki Contributions


This is really interesting work, and I hope it will be of value to educating people about negotiation. I agree with many of the things you say here, and many of your design decisions, but I think there's a few minor points where I see things differently.

I think it's valuable to player motivation to give them the ability to track their own performance between games. I think this could look like having a set of pre-designed maps, with a certain difficulty score assigned to each role. Then players could track how well they did against a difficulty score. Having deliberately unbalanced maps, with some easy roles and some hard roles could be good for uneven matches (experienced player vs new player).

I think that being able to violate a contract (at some cost, perhaps reputational) when the motivation is sufficiently high gives some spice to the dynamics of negotiation. Also, seems like players should be able to mutually consent to dissolving a contract. So I think future versions should have some thought put into non-omnipotent contract enforcement mechanisms. Not letting any single player become so powerful that they can afford to ignore all contractual obligations is a pretty important lesson.

I'm not sure that having hidden agendas makes a negotiation game not work. What if you have a hidden agenda, but you are allowed to proclaim (but not prove) it? This adds a layer of complexity which might be undesirable, especially for less advanced players, since you have to track whether another player's claimed goals match up with their behavior. I think this is pretty key if you want to have an 'evil' player, who all the other players can only 'win' against by eliminating them. Like in the social Werewolf game, the 'evil' player would be motivated to claim not to be. In order to not get ganged up against, they'd have to try to manufacture situations in which it seemed justified for them to eliminate another player. Others would have to detect them by noticing the unnecessary and unprofitable conflict. This would teach the important real-world negotiation lesson of keeping an eye out for true values conflicts versus lying troublemakers with hidden agendas who should be punished. Then you'd have to negotiate things like, 'if I pay a cost to eliminate an agent of player X, who we suspect of being evil, then I want concession A as a reward.'

Or a partial information style of game, like Stratego. Maybe if you can prove someone violated a contract, you can call them out and they must pay the forfeit, but if you only suspect it then you must just be wary of them.

It could also be interesting to have rounds of the game with and without omnipotent contract enforcement, and with/without hidden agendas, and with/without the game being a partial-information game, so people could learn how each of those elements changes things. I think it would quickly become clear that a world with trustworthy contract enforcement is a less violent, more positive-sum world to live in.

I think it's worth thinking about what makes cooperative games fun, and what makes creative games like playing minecraft or legos with friends fun. What if you were playing minecraft with other people in a very space-limited minecraft world, and had to negotiate over resources? Being able to build what you wanted to build would require negotiation. I was trying to think through a design of such a game with the intention of using it as a testing ground for AI alignment. In my hazy undeveloped thoughts were mechanics like: blocks get harder to destroy every time they are placed (so the more times a specific block is moved, the harder it gets to move in the future), players can tag a limited number of blocks as 'theirs' making it much harder for other people to destroy them. Imagine if each player had a different goal they were trying to achieve in a limited time, such as making the coolest waterfall or the coolest castle or treehouse or cave dwelling. When the time in a round ran out, the blocks would all get frozen and players could then take their 'final result' screenshot from whatever point of view they thought best showed off their work. Then some other group of players, playing their own version of the game, would vote on the anonymous submissions. Or maybe a ML model could be made to judge the competition. In any case, seems like something that a minecraft mod could encompass.

An interesting thing about this, is that there wouldn't be an omniscient overview judge. So players could try to get away with sneakily stealing resources they'd agreed not to, and then face consequences if they were caught. So you could more easily trust agreements made about resources that were easier to monitor.

What if you catch a player violating a contract, and then they also decide to refuse to pay the agreed upon penalty? This sort of thing is a common cause of conflict escalation between nation-states. Would the other players come to the assistance of the wronged party, or choose to gang up against them or simply refuse to intervene since they'd have to pay a cost to intervene...

I think there's a lot of room for thought and game design here. 

I am an experienced data scientist and machine learning engineer with a background in neuroscience. In my previous job I had a senior position and lead a team of people, and spent ~10k weekly on hundreds of model training runs on AWS and GCP using pipelines I wrote from scratch which profitably guided the expenditure of hundreds of thousands of dollars daily in rapid real-time bidding. I've spent many years reading LessWrong and thinking about the alignment problem. I've been working full-time on independent AI alignment research for about a year and half now. I got a small grant from the Long Term Future Fund to help support my transition to working on AI alignment. I took the AI safety fundamentals course, and found it enjoyable and helpful even though it turned out I had already read all the assigned readings. I read a lot of research papers, in the alignment field specifically and ML and neuroscience generally. I'm friends with and talk regularly with employed AI alignment researchers. I've gone to 3 EAGs and had many interesting one-on-ones with people to discuss AI alignment and safety evals and governance, etc.

Over the past year and half I've applied and been rejected from many different alignment orgs and safety teams within capabilities orgs. I'm sick of trying to work entirely on my own with no colleagues, I work much better as part of a team. I've applied for more grant funding for independent research but I'm not happy about it. I'm considering trying to find a part-time mainstream ML job just so that I can have a team and work on building stuff productively again.

I'd love to start an org to pursue my alignment agenda, and feel like I have plenty of ideas to pursue to keep a handful of employees busy, and sufficient leadership experience to manage a team. 

Here's a video of a talk I gave in which I discuss some of my research ideas. This is only a small fraction of what I've been up to in the past couple years. Link to a recording of my recent talk at the Virtual AI Safety Unconference:

If you find my research ideas intriguing, and might be interested in forming an org with me or interviewing me as a possible fit to work at your existing org, please reach out. You can message me here on LessWrong and I'll share my resume and email.

I certainly think that developing fundamentally more interpretable models from scratch is a wise path forward for humanity. I think you make some reasonable proposals for directions that could be pursued. There are quite a few researchers and groups working on a wide variety of directions for this sort of fundamentally more interpretable and controllable AI. For example: 

The downside is that it's almost certainly a slower path to power. If you don't simultaneously slow down all the other, more direct, paths to raw AI power then the slow paths become irrelevant. Like building a very safe campfire in the woods right next to someone building a huge dangerous bonfire. So then you get into the issue of worldwide monitoring and enforcement of AI R&D, which is not an easy problem to tackle. Another way of thinking about this is saying that pursuing safer but less straightforwardly powerful approaches is paying an 'alignment tax'. 

I am very much in favor of this approach by the way. I'm just really concerned about the feasibility and success likelihood of worldwide regulatory enforcement.

I don't think it's necessary to assume 'stopping the aging process entirely'. I think you can say something like 'slow down the aging process enough that people the age of the author of this post don't die of old age before the next level of tech comes along (e.g. gradually replacement of failing body parts with cybernetic ones, and eventually uploading to an immortal digital version.)

If you think the coming of AGI is inevitable, but you think that surviving AGI is hard and you might be able to help with it, then you should do everything you can to make the transition to a safe AGI future go well. Including possibly sacrificing your own life, if you value the lives of your loved ones in aggregate more than your life alone. In a sense, working hard to make AGI go well is 'risk aversion' on a society-wide basis, but I'd call the attitude of the agentic actors in this scenario more one of 'ambition maximizing' rather than 'personal risk aversion'.

Time between surviving AGI and solving aging

I model this as an exponential distribution with a mean time of 5 years.


I just want to point out that one of the big hopes for initial survival of AGI is a worldwide regulatory delay in 'cranking it up'. Thus, AGI could exist in containment in labs but not be being used at 'full power' on any practical projects like solving aging or digitizing humans. In this scenario, I'm not sure whether you would count this as AGI not invented yet, since it's not truly in use, or whether you'd call this invented but survival still unclear since if it got loose we'd all die, or what. Basically, I want to bring up the possibility of a 'fuzzy post-invention time full of looming danger, spanning perhaps 20-30 years.'

That does seem like the right threshold under any reasonable meaning of catastrophic, so long as it is understood that once found no patch can address the issue. The next level of models may or may not reach ASL-3. My guess is a 4.5-level model mostly wouldn’t count, a 5-level model mostly would count.


My current not-strongly-held guess is that something like 'GPT-5' won't be ASL-3, but will be close and will be above a threshold where it will be significantly useful in producing the next generation. Something I'd be willing to call a weak early form of recursive self-improvement. Thus, I would expect the next generation, 'GPT-6', to be clearly into ASL-3 and able to do quite a bit of useful R&D work. Thus, I would expect a 'GPT-7' to be ASL-4. If each of these steps takes 1.5 years, that's really not long until ASL-4.

See my related prediction market here: 

(market makes a slightly stronger case than stated here, in order to be controversial enough to be interesting)

Hephaistos Fnord (reply to OP): Cassandra is a dig bc being believed is more prestigious than being right. If you’re right and no one believes you, you’re still a loser. And even if everyone dies because they didn’t believe you, you still die a loser.


I feel personally attacked by this relatable content. How dare Hephaistos just come out and tell the world the key content of my recurring nightmares?

Yes, that is the crux of the issue. The BATNA is the theoretically correct answer, but hard to discover/elicit accurately.

Possible counterpoint: people aren't as unique as we like to imagine in order to flatter ourselves. I worked on Machine Learning in Ad targeting for several years, and my takeaway is that the majority of people (when looking at the scale of many millions of people) fall into groups. It's not hard to pattern match a person to a behavioral group, and then use a strategy tailored to that specific group in order to manipulate the individual. So you'd need not only you, but everyone sufficiently similar to you, to have intense privacy standards. If that's the case, it just isn't feasible. We need to focus more on the fact that creating the manipulative AI agent at all is a dangerous idea, and less on trying to narrowly protect ourselves. Even protecting yourself and your in-group doesn't help much, if the majority of society gets powerfully manipulated and become tools of the puppet-master AI.

Load More