Daniel Kokotajlo

Philosophy PhD student, worked at AI Impacts, then Center on Long-Term Risk, now OpenAI Futures/Governance team. Views are my own & do not represent those of my employer. I subscribe to Crocker's Rules and am especially interested to hear unsolicited constructive criticism. http://sl4.org/crocker.html

Two of my favorite memes:


(by Rob Wiblin)

My EA Journey, depicted on the whiteboard at CLR:

 

Sequences

Agency: What it is and why it matters
AI Timelines
Takeoff and Takeover in the Past and Future

Wiki Contributions

Comments

Thanks for this comment. I'd be generally interested to hear more about how one could get to 20% doom (or less).

The list you give above is cool but doesn't do it for me; going down the list I'd guess something like:
1. 20% likely (honesty seems like the best bet to me) because we have so little time left, but even if it happens we aren't out of the woods yet because there are various plausible ways we could screw things up. So maybe overall this is where 1/3rd of my hope comes from.
2. 5% likely? Would want to think about this more. I could imagine myself being very wrong here actually, I haven't thought about it enough. But it sure does sound like wishful thinking.
3. This is already happening to some extent, but the question is, will it happen enough? My overall "humans coordinate to not build the dangerous kinds of AI for several years, long enough to figure out how to end the acute risk period" is where most of my hope comes from. I guess it's the remaining 2/3rds basically. So, I guess I can say 20% likely.
4. What does this mean?

I would be much more optimistic if I thought timelines were longer.

Thanks, & thanks for putting in your own perspective here. I sympathize with that too; fwiw Vladimir_Nesov's answer would have satisfied me, because I am sufficiently familiar with what the terms mean. But for someone new to those terms, they are just unexplained jargon, with links to lots of lengthy but difficult to understand writing. (I agree with Richard's comment nearby). Like, I don't think Vladimir did anything wrong by giving a jargon-heavy, links-heavy answer instead of saying something like "It may be hard to construct a utility function that supports the latter but rejects the former, but if instead of utility maximization we are doing something like utility-maximization-subject-to-deontological-constraints, it's easy: just have a constraint that you shouldn't harm sentient beings. This constraint doesn't require you to produce more sentient beings, or squeeze existing ones into optimized shapes." But I predict that this blowup wouldn't have happened if he had instead said that. 

I may be misinterpreting things of course, wading in here thinking I can grok what either side was thinking. Open to being corrected!

Here's where I think the conversation went off the rails. :( I think what happened is M.Y.Zuo's bullshit/woo detector went off, and they started asking pointed questions about the credentials of Critch and his ideas. Vlad and LW more generally react allergically to arguments from authority/status, so downvoted M.Y.Zuo for making this about Critch's authority instead of about the quality of his arguments.

Personally I feel like this was all a tragic misunderstanding but I generally side with M.Y.Zuo here -- I like Critch a lot as a person & I think he's really smart, but his ideas here are far from rigorous clear argumentation as far as I can tell (I've read them all and still came away confused, which of course could be my fault, but still...) so I think M.Y.Zuo's bullshit/woo detector was well-functioning.

That said, I'd advise M.Y.Zuo to instead say something like "Hmm, a brief skim of those posts leaves me confused and skeptical, and a brief google makes it seem like this is just Critch's opinion rather than something I should trust on authority. Got any better arguments to show me? If not, cool, we can part ways in peace having different opinions."

I was one of the people who upvoted but disagreed -- I think it's a good point you raise, M. Y. Zuo, that So8res' qualifications blunt the blow and give people an out, a handy rationalization to justify continuing working on capabilities. However, there's still a non-zero (and I'd argue substantial) effect remaining.

Makes sense. I had basically decided by 2021 that those good futures (1) and (2) were very unlikely, so yeah.

Whereas my timelines views are extremely well thought-through (relative to most people that is) I feel much more uncertain and unstable about p(doom). That said, here's why I updated:

Hinton and Bengio have come out as worried about AGI x-risk; the FLI letter and Yudkowsky's tour of podcasts, while incompetently executed, have been better received by the general public and elites than I expected; the big labs (especially OpenAI) have reiterated that superintelligent AGI is a thing, that it might come soon, that it might kill everyone, and that regulation is needed; internally, OpenAI at least has pushed more for focus on these big issues as well. Oh and there's been some cool progress in interpretability & alignment which doesn't come close to solving the problem on its own but makes me optimistic that we aren't barking up the wrong trees / completely hitting a wall. (I'm thinking about e.g. the cheese vector and activation vector stuff and the discovering latent knowledge stuff)

As for capabilities, yes it's bad that tons of people are now experimenting with AutoGPT and making their own LLM startups, and it's bad that Google DeepMind is apparently doing some AGI mega-project, but... those things were already priced in, by me at least. I fully expected the other big corporations to 'wake up' at some point and start racing hard, and the capabilities we've seen so far are pretty much exactly on trend for my What 2026 Looks Like scenario which involved AI takeover in 2027 and singularity in 2028.

Basically, I feel like we are on track to rule out one of the possible bad futures (in which the big corporations circle the wagons and say AGI is Safe there is No Evidence of Danger the AI x-risk people are Crazy Fanatics and the government buys their story long enough for it to be too late.) Now unfortunately the most likely bad future remains, in which the government does implement some regulation intended to fix the problem, but it fails to fix the problem & fails to buy us any significant amount of time before the dangerous sorts of AGI are built and deployed. (e.g. because it gets watered down by tech companies averse to abandoning profitable products and lines of research, e.g. because racing with China causes everyone to go 'well actually' when the time comes to slow down and change course)

Meanwhile one of the good futures (in which the regulation is good and succeeds in preventing people from building the bad kinds of AGI for years, buying us time in which to do more alignment, interpretability, and governance work, and for the world to generally get more awareness and focus on the problems) is looking somewhat more likely.

So I still think we are on a default path to doom but one of the plausible bad futures seems less likely and one of the plausible good futures seems more likely. So yeah.

I'd be curious to hear more about this "contributes significantly in expectation" bit. Like, suppose I have some plan that (if it doesn't work) burns timelines by X, but (if it does work) gets us 10% of the way towards aligned AGI (e.g. ~10 plans like this succeeding would suffice to achieve aligned AGI) and moreover there's a 20% chance that this plan actually buys time by providing legible evidence of danger to regulators who then are more likely to regulate and more likely to make the regulation actually useful instead of harmful. So we have these three paths to impact (one negative, two positive) and I'm trying to balance the overall considerations. I suppose you'd say (a) do the math and see what it says, and (b) be vigilant against rationalization / wishful thinking biasing your math towards saying the benefits outweigh the costs. Is that right? Anything else you want to say here?

(A concrete example here might be ARC Evals' research which may have inadvertently burned timelines a bit by inspiring the authors of AutoGPT who read the GPT-4 system card, but iiuc lots of people like Langchain were doing stuff like that anyway so it probably didn't make more than a few week's difference, and meanwhile the various beneficial effects of their evals work seem quite strong.)

(Perhaps a useful prompt would be: Do you think it's useful to distinguish between capabilities research and research-which-has-a-byproduct-of-giving-people-capabilities-ideas? Why or why not?)

Great post, will buy the book and take a look!

I feel like I vaguely recall reading somewhere that some sort of california canvassing to promote gay rights experiment either didn't replicate or turned out to be outright fraud. Wish I could remember the details. It wasn't the experiment you are talking about though hopefully?

I agree that logodds space is the right way to think about how close probabilities are. However, my epistemic situation right now is basically this:

"It sure seems like Doom is more likely than Safety, for a bunch of reasons. However, I feel sufficiently uncertain about stuff, and humble, that I don't want to say e.g. 99% chance of doom, or even 90%. I can in fact imagine things being OK, in a couple different ways, even if those ways seem unlikely to me. ... OK, now if I imagine someone having the flipped perspective, and thinking that things being OK is more likely than doom, but being humble and thinking that they should assign at least 10% credence (but less than 20%) to doom... I'd be like "what are you smoking? What world are you living in, where it seems like things will be fine by default but there are a few unlikely ways things could go badly, instead of a world where it seems like things will go badly by default but there are a few unlikely ways things could go well? I mean I can see how you'd think this is you weren't aware of how short timelines to ASI are, or if you hadn't thought much about the alignment problem..."

If you think this is unreasonable, I'd be interested to hear it! 

 

Thanks for this post! I definitely disagree with you about point I (I think AI doom is 70% likely and I think people who think it is less than, say, 20% are being very unreasonable) but I appreciate the feedback and constructive criticism, especially section III.

If you ever want to chat sometime (e.g. in a comment thread, or in a video call) I'd be happy to. If you are especially interested I can reply here to your object-level arguments in section I. I guess a lightning version would be "My arguments for doom don't depend on nanotech or anything possibly-impossible like that, only on things that seem clearly possible like ordinary persuasion, hacking, engineering, warfare, etc. As for what values ASI agents would have, indeed, they could end up just wanting to get low loss or even delete themselves or something like that. But if we are training them to complete ambitious tasks in the real world (and especially, if we are training them to have ambitious aligned goals like promoting human flourishing and avoiding long-term bad consequences), they'll probably develop ambitious goals, and even if they don't, that only buys us a little bit of time before someone creates one that does have ambitious goals. Finally, even goals that seem very unambitious can really become ambitious goals when a superintelligence has them, for galaxy-brained reasons which I can explain if you like. As for what happens after unaligned ASI takes over the world -- agreed, it's plausible they won't kill us. But I think it's safe to say that unaligned ASI taking over the world would be very bad in expectation and we should work hard to avoid it."

Load More