Recent Discussion


This post is meant to be a linkable resource.  It contains a short list of guidelines that are meant to be fairly straightforward and uncontroversial, for the purpose of nurturing and strengthening a culture of clear thinking, clear communication, and collaborative truth-seeking.

"Alas," said Dumbledore, "we all know that what should be, and what is, are two different things.  Thank you for keeping this in mind."

There is also (for those who want to read past the simple list) substantial expansion/clarification of each specific guideline, along with justification for the overall philosophy behind the set.

On Shorthand

Once someone has a deep, rich understanding of a complex topic, they are often able to refer to that topic with short, simple sentences that correctly convey the intended meaning to other people with...

for a basics, this post is long, and I have a lot of critique I'd like to write that I'd hope to see edited. However, this post has been posted to a blogging platform, not a wiki platform; it is difficult to propose simplifying refactors for a post. I've downvoted for now and I think I'm not the only one downvoting, would be curious to hear reasons for downvotes from others and what would reverse them. would be cool if lesswrong was suddenly a wiki with editing features and fediverse publishing. you mention you want to edit; looking forward to those, hoping to upvote once edited a bit.

unrelated, did you know lesswrong has a "hide author until hovered" feature that for some reason isn't on by default with explanation? :D

If you think this post would be stronger with more real-world examples of each guideline (either failures to follow it, or stellar examples of abiding by it), then please keep your radar open for the next few days or weeks, and send me memorable examples. I am not yet sure what I'll do with those, but having them crowdsourced and available is better than not having them at all, or trying to collect them all myself. Also: I anticipate substantial improvements to the expansions over time, as well as modest improvements to the wording of each of the short/relatively pithy expressions of the guidelines. I'm super interested in those. I'm less expecting to be convinced by a bid to completely nix one of these, add a brand-new one, or swap one, but that wouldn't shock me, so feel free to make those, too.

This is a lightly edited transcript of a chatroom conversation between Scott Alexander and Eliezer Yudkowsky last year, following up on the Late 2021 MIRI Conversations. Questions discussed include "How hard is it to get the right goals into AGI systems?" and "In what contexts do AI systems exhibit 'consequentialism'?".


1. Analogies to human moral development


@ScottAlexander ready when you are


Okay, how do you want to do this?


If you have an agenda of Things To Ask, you can follow it; otherwise I can start by posing a probing question or you can?

We've been very much winging it on these and that has worked... as well as you have seen it working!


Okay. I'll post from my agenda. I'm assuming we both have the right to edit logs


Fair enough. Nonetheless, I have had this experience many times with Eliezer, including when dialoguing with people with much more domain-experience than Scott.

4Ben Pace39m
That makes more sense.
9Chris Leong1h
I wish Eliezer had been clearer on why we can’t produce an AI that internalises human morality with gradient descent. I agree gradient descent is not the same as a combination of evolutionary learning + within lifetime learning, but it wasn’t clear to me why this meant that no combination of training schedule and/or bias could produce something similar.
There is a disconnect with this question. I think Scott is asking “Supposing an AI engineer could create something that was effectively a copy of a human brain and the same training data, then could this thing learn the “don’t steal” instinct over the “don’t get caught” instinct?” Eliezer is answering “Is an AI engineer able to create a copy of the human brain, provide it with the same training data a human got, and get the “don’t steal” instinct?”

Truffles are pretty great: gooey fat-softened flavored chocolate inside, protected by a shell of hard chocolate so you don't get your fingers messy. Sometimes I'll make them with the kids:

This is a good amount of work: you need to prepare the filling, chill, shape little spheres, chill, and then dip them. Worth it for making presents or sharing, but not if I'm making something for myself. Instead I just make a jar of ganache:

I'll eat it with a spoon over a week or two, 1-3oz after lunch each day.

At it's simplest it's chocolate and fat. I normally use Callebaut semi-sweet which I can get in bulk for $3.42/lb ($0.21/oz):

I'll usually use butter and cream, though coconut cream works well if you want to make it vegan. Often I'll mix in some...

We had some discussions of the AGI ruin arguments within the DeepMind alignment team to clarify for ourselves which of these arguments we are most concerned about and what the implications are for our work. This post summarizes the opinions of a subset of the alignment team on these arguments. Disclaimer: these are our own opinions that do not represent the views of DeepMind as a whole or its broader community of safety researchers.

This doc shows opinions and comments from 8 people on the alignment team (without attribution). For each section of the list, we show a table summarizing agreement / disagreement with the arguments in that section (the tables can be found in this sheet). Each row is sorted from Agree to Disagree, so a column does not...

1Adrien Chauvet3h
#4. Can't cooperate to avoid AGI Maybe we can. This is how the Montreal Protocol [] came about: scientists discovered that chlorofluorocarbons were bad for the ozone. Governments believed them, then the Montreal Protocol was signed, and CFC use fell by 99.7%, leading to the stabilization of the ozone layer, perhaps the greatest example of global cooperation in history. It took around 15 years from the time scientists discovered that chlorofluorocarbons were causing a major problem to the time the Montreal Protocol was adopted. How can scientists convince the world to cooperate on AGI alignment in less time?

They haven't managed to do it so far for climate change, which has received massively more attention than AGI. I have seen many times this example being used to argue that we can indeed be successful at coordinating for major challenges, but I think this case is misleading: CFC never played a major role in the economy and they were easily replaceable, so forbidding them was not such an important move. 

[Neurotic status: Until recently, I've felt entirely comfortable telling friends, family, and co-workers that I'm a rationalist (or an aspiring rationalist)—that it wouldn't out-group me among them, or if it did, it was probably worth being out-grouped over. I will happily continue to use Bayesian reasoning, but I can't rep the label "IRL" right now. The distal reasons for this change are (1) I believe rationalism and effective altruism as movements are losing their ability to self-correct, (2) I'm not in a position to help fix (1). I hope this is temporary or that I'm wrong. I started writing last week, and there may have been development since then that I missed. I love you all like a milkshake.

I avoided using easily searchable terms below to keep...

It feels like there is something interesting in this post, but I found it scattered and meandering to the point of unreadability, sadly.

I too tried to read this post and couldn't figure out what its point was most of the time.

IMO, the series of crises in the EA movement is at partly related to the fact that the EA movement is obviously having issues. With that stated, I think EA probably can't do well for a very basic reason: The idea that you can improve or harm society's preferences, because there are no preferences of society, since it doesn't exist. Thus I think EA is trying to do an impossible goal. This is explored in the post below on Thatcher's Axiom: [] IMO, I think the rationality project/LW is handling these crises far better than EA is doing.

This weekend I was in San Luis Obispo for a gig, about halfway between SF and LA. It's possible to fly into SBP, but since I was traveling with two of my kids it was a lot cheaper to fly into SFO and drive down, and only slightly slower.

I'm signed up with Hertz's reward program ("Gold") and one of the benefits is that you pick out your own car. When I got to SFO, there were several Tesla Model 3s in the "Gold" area. This was somewhat surprising—I had only paid for a small sedan ("B") and the Teslas are fancy ("E7")—but it seemed like it would be interesting to try one out and I liked the idea of not paying for gas. I got us loaded up but when I got to the exit...

The attack surface differs enormously from car to car.

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Subscribe to Curated posts
Log In Reset Password
...or continue with

Let’s say I know how to build / train a human-level (more specifically, John von Neumann level) AGI. And let’s say that we (and/or the AGI itself) have already spent a few years[1] on making the algorithm work better and more efficiently.

Question: How much compute will it take to run this AGI?

(NB: I said "running" an AGI, not training / programming an AGI. I'll talk a bit about “training compute” at the very end.)

Answer: I don’t know. But that doesn’t seem to be stopping me from writing this post. ¯\_(ツ)_/¯ My current feeling—which I can easily imagine changing after discussion (which is a major reason I'm writing this!)—seems to be:

  • 75%: One current (Jan 2023) high-end retail gaming PC (with an Nvidia GeForce RTX 4090 GPU) will be enough (or
9000 books with one megabyte of information each. Note, that he memorized whole telephone books that way and tests have shown that he has >95% recall.

Still mulling this over. I may end up revising the post and/or writing a follow-up. :)

At the end of the second and final round of the Inverse Scaling Prize, we’re awarding 7 more Third Prizes. The Prize aimed to identify important tasks on which language models (LMs) perform worse the larger they are (“inverse scaling”). Inverse scaling may reveal cases where LM training actively encourages behaviors that are misaligned with human preferences. The contest started on June 27th and concluded on October 27th, 2022 – thanks to everyone who participated! Across the two rounds, we had over 80 unique submissions and gave out a total of 11 Third Prizes. 

We are also accepting updates to two previous prize-winners (quote-repetition and redefine-math). For more details on the first round winners, see the Round 1 Announcement Post.

We didn't find the kind of robust, major long-term-relevant problems that...

Maybe we need to start using prompts like "This is not a trick question; just take it step by step:"! -------------------------------------------------------------------------------- Incidentally, looks like understanding multi-step legal criteria might be a case of U-shaped scaling too: "Large Language Models as Fiduciaries: A Case Study Toward Robustly Communicating With Artificial Intelligence Through Legal Standards", Nay 2023 [] finds that understanding whether someone has a fiduciary legal obligation goes from 27% (Curie) → 50% (random baseline) → 73% (text-davinci-002) → 78% (text-davinci-003), so presumably there's a smaller model-size which outperforms Curie by random guessing, giving a U-curve from random smol to bad Curie to great davinci.
3Victor Levoso8h
I want to note that a lot of the behaviors found in the inverse scaling price do in fact disappear by just adding "Lets think step by step". I already tested this a bit a few months ago in apart research's hackaton along whit a few other people [] and migtht try to do it more rigorously for all the entries now that all of the winners have been announced(plus I was procrastinating on it and this is a good point to actually get around doing it) Also another thing to note is that chatgpt shows the same behaviour and answers in a more detailed way. Given that I feel like if someone was going to take models failing at modus ponens as evidence of the "Clever Hans" hypothesis they should not only undo that update but also update on the other direction by casting doubts about whatever they though was an example of LLM not being able to do something.
I was wondering whether to comment on how to take a demonstration of inner-monologue or alternate prompting approaches solving the problems... There's definitely a bunch of different ways you can interpret that outcome. After all, even if you solve it with a better prompt, the fact remains that they demonstrated inverse scaling on the original prompt. So what does that mean? I guess that depends on what you thought inverse scaling was. One way is to take the inverse-scaling as a sub-category of hidden scaling: it 'really' was scaling, and your 'bad prompts' just masked the hidden scaling; it had the capability and 'sampling can show the presence of knowledge but not the absence', and the Contest has been useful primarily in experimentally demonstrating that skilled ML professionals can be hoodwinked into severely underestimating the capabilities of powerful DL models, which has obvious AI safety implications.

I think its also not obvious how it solves the problem, whether its about the model only being capable of doing the reasoning required using multiple steps(though why the inverse scale then) or something more like writing an explanation makes the model more likely to use the right kind of reasoning. 

And inside of that second option there's a lot of ways that could work internally whether its about distributions of kinds of humans it predicts, or something more like different circuits being activated in different contexts in a way that doesn't have to ... (read more)

On January 26, 2023, NIST released the AI Risk Management Framework (AI RMF 1.0) along with a companion NIST AI RMF Playbook, AI RMF Explainer Video, an AI RMF RoadmapAI RMF Crosswalk, and various Perspectives. Watch the event here.

In collaboration with the private and public sectors, NIST has developed a framework to better manage risks to individuals, organizations, and society associated with artificial intelligence (AI). The NIST AI Risk Management Framework (AI RMF) is intended for voluntary use and to improve the ability to incorporate trustworthiness considerations into the design, development, use, and evaluation of AI products, services, and systems.

Released on January 26, 2023, the Framework was developed through a consensus-driven, open, transparent, and collaborative process that included a Request for Information, several draft versions for public comments, multiple workshops, and


Does it say anything about AI risk that is about the real risks? (Have not clicked the links, the text above did not indicate to me one way or another).

1Evan R. Murphy11h
Been in the works for awhile. Good to know it's officially out, thanks. Related: []