All of Chris_Leong's Comments + Replies

This is very tricky. On one hand, this may actually Streisand effect these results to greater prominence. On the other hand, at the point where people were specifically working around this to gain access to log-in gated LW resources, this would probably enhance our community status/prestige which might actually increase our influence.

Absolutely fascinating! Thanks for posting! This gave me a new lens onto Nietzsche.

You apparently completely misunderstood the point we were making with the white box thing.


I think you need to taboo the term white box and come up with a new term that will result in less confusion/less people talking past each other.

2the gears to ascension3d
We really need to replace "taboo", it carries far too many misleading implications

Some thoughts since you ask for feedback elsewhere:

1) Part of the reason why this post is likely being downvoted is the clickbait title. This is not looked upon favorable on Less Wrong.
2) You make some pretty good points in this post, but you state it far too confidently, almost like a mathematical proof. If you want your posts to do well on Less Wrong, try not to make strong claims without correspondingly strong evidence.

Gabriel makes a very good point: there is something of a tension between allowing reign of terror moderation and considering it a norm violation to request the deletion of comments for low quality.

(TBH, I was convinced that reign of terror would be a disaster, but it seems to be working out okay so far).

What do you think the 101 space should look like for Less Wrong?

I imagine an explanation that gets progressively longer. One-paragraph summary. One-screen explanation. One article containing the gist of the Sequences (with links to Read the Sequences), a brief history of the rationalist movement, frequently used concepts (with links to explanations), maybe a list of frequent myths.

Technically, all three could be at the same URL.

Short answer: I think there should be more than one of them, but the pinned Open Thread are pretty good when combined with the New Users Guide and the Sequence Highlights. There's an interesting problem in that the people who most need the entry level information are the people who least know how to find it, and the least likely to be used to whatever idiom the information is in. "Look in the archives for the last time we discussed this issue" is a lovely theory, but finding something new in the archives of a new place is harder than the locals think it is. Because of this issue, I think every point where someone new might show up (comments on a post, an ACX Everywhere meetup, in the replies to one of Yudkowsky's tweets, etc) would ideally be able to point the newcomer to a back and forth with someone patient and interested in helping. To be clear, that's a find wish of my heart, not something I think is practical to pull off. Personally I spend more time and energy on the in-person meetups than I do for online spaces. In that context, making these spaces means having meetups that discuss things I'm already pretty sure the regulars know. I try to come up with ways to make going over the basics again interesting and work for multiple knowledge levels, but at least any topic that hasn't had a meetup in the last year is a topic I shouldn't try to build on. I make myself obvious and visible as the organizer and try to notice when someone is new or looks confused. Most of my meetups are Rationality 101 spaces, and that's a deliberate tradeoff I make knowing that there's other organizers in my city who can try to push a bit deeper. Then again, I'm also the guy who wrote Write A Thousand Roads To Rome. If someone says they prefer youtube videos, I point them at Robert Miles and Rational Animations. If someone says they prefer spicy blog posts I point them at specific SlateStarCodex posts. If someone says they want to talk to someone in person, I try and point them at a

E/acc seems to be really fired up about this:

It is really hard to use social media to measure public opinion, even if Twitter/X doesn't have nearly as much security or influence capabilities as Facebook/Instagram, botnet accounts run by state-adjacent agencies can still game Twitter's algorithms by emulating human behavior and upvoting specific posts in order to game Twitter's newsfeed algorithm for the human users.  Social media has never been an environment that is friendly to independent researchers; if it was easy, then foreign intelligence agencies would run circles around independent researchers in order to research advanced strategies to manipulate public opinion (e.g. via their own social media botnets, or merely just knowing what to say when their leaders give speeches). But yes. E/acc seems to be really fired up about this.

I read their other comments and I'm skeptical. The tone is wrong.

3Ben Pace17d
It read like propaganda to me, whether the person works at the company or not.

AI alignment certification and peacebuilding seem like two very different and distinct projects. I'd strongly suggest picking one.

Agreed. You'll bifurcate the mission and end up doing both things worse than you would have done if you'd just picked one and focused.

I'm confused by your Fast Deceptive Mesaoptimiser example. How does the speed prior come in here? It isn't mentioned in the description.

I also notice that I am just afraid of what would happen if I were to e.g. write a post that's just like "an overview over the EA-ish/X-risk-ish policy landscape" that names specific people and explains various historical plans. Like I expect it would make me a lot of enemies.

This seems like a bad idea.

Transparency is important, but ideally, we would find ways to increase this without blowing up a bunch of trust within the community. I guess I'd question whether this is really the bottleneck in terms of transparency/public trust.

I'm worried that as a response to FTX we might end up turning this into a much more adversarial space.

I think a better plan looks something like "You can't open source a system until you've determined and disclosed the sorts of threat models your system will enable, and society has implemented measures to become robust to these threat models. Once any necessary measures have been implemented, you are free to open-source." 

The problem with this plan is that it assumes that there are easy ways to robustify the world. What if the only proper defense against bioweapons is a complete monitoring of the entire internet? Perhaps this is something that we'd li... (read more)

For the record, I updated on ChatGPT. I think that the classic example of imagining telling an AI to get a coffee and it pushes a kid out of the way isn't so much of a concern any more. So the remaining concerns seem to be inner alignment + outer alignment far outside normal human experience + value lock-in.

Thanks for highlighting the issue with the discourse here. People use the word evidence in two different ways which often results in people talking past one another.

I'm using your broader definition, where I imagine that Stella is excluding things that don't meet a more stringent standard.

And my claim is that reasoning under uncertainty sometimes means making decisions based on evidence[1] that is weaker than we'd like.

  1. ^

    Broad definition

Oh, I don't think it actually would end up being temporary, because I expect with high probability that the empirical results of more robust evaluations would confirm that open-source AI is indeed dangerous. I meant temporary in the sense that the initial restrictions might either a) have a time-limit b) be subjective to re-evaluation at a specified point.

Define evidence.

I'm not asking this just to be pedantic, but because I think it'll make the answer to your objection clearer.

Evidence for X is when you see something that's more likely in a world with X than in a world with some other condition not X. Generally substantially more likely; for good reason many people only use "evidence" to mean "reasonably strong evidence."
  1. Thanks for sharing your concerns and helping us be more calibrated on the value of this study.
  2. I agree that a control group is vital for good science. Nonetheless, I think that such an experiment is valuable and informative, even if it doesn't meet the high standards required by many professional science disciplines.
  3. I believe in the necessity of acting under uncertainty. Even with its flaws, this study is sufficient evidence for us to want to enact temporary regulation at the same time as we work to provide more robust evaluations.
  4. The biggest critique for m
... (read more)
But... this study doesn't provide evidence that LLMs increase bioweapon risk.

Even with its flaws, this study is sufficient evidence for us to want to enact temporary regulation at the same time as we work to provide more robust evaluations.

Note -- that if I thought regulations would be temporary, or had a chance of loosening over time after evals found that the risks from models at compute size X would not be catastrophic, I would be much less worried about all the things I'm worried about re. open source and power and and banning open source

But I just don't think that most regulations will be temporary. A large number of people... (read more)

I don't know if you need professionals, just a bunch of people who are interested in discussing the topic. It wouldn't need to use the Alignment Forum's invite-only system.

Instead, it would just be a way to allow LW to cater to both audiences at the same time.

I'm still pretty skeptical of what would happen without explicit focus. The Bletchley Park declaration was a super vague and applause-lighty declaration, which fortunately mentions issues of control, but just barely. It's not clear to me yet that this will end up receiving much-dedicated focus.

Regarding biosecurity and cyber, my big worry here is open-source and it seems totally plausible that a government will pass mostly sensible regulation, then create a massive gaping hole where open-source regulation should be.

It’s also plausible that the US government will ban or seriously restrict open source frontier models. Section 4.6 of the EO requests a government report on the costs and benefits of open source frontier models. Companies are required to report on what steps they take to “secure model weights.” These are the kinds of actions the government would take if they were concerned about open source models and thinking about banning them.

Any chance that you could make this more concrete by specifying such a proposal? I expect it'd be possible to make up an example if you want to avoid criticising any specific project.

I have seen several people say that EAs should focus on promoting stupid legislation that slows down AI incidentally, since that's "our best hope" to make sure things go well. In one of the footnotes, I cited an example of someone making this argument.

While this example could be dismissed as a weakman, I've also seen more serious proposals that I believe share both this theme and tone. This is how I currently perceive some of the "AI pause" proposals, especially those that fail to specify a mechanism to adjust regulatory strictness in response to new evide... (read more)

This is a good point. I neglected to address this possibility.

“If you then say we can split A into A1 and A2, you have added information to the problem. Like the Monty Hall problem, information can change the odds in unexpected ways!” - It’s not clear which is the baseline.

The point there is that there is no contradiction because the informational content is different.  "Which is the baseline" is up to the person writing the problem to answer.  You've asserted that the baseline is A vs B; then you've added information that A is actually A1 and A2. The issue here is entirely semantic ambiguity. Observe what happens when we remove the semantic ambiguity: You've been observing a looping computer program for a while, and have determined that it shows three videos.  The first video portrays a coin showing tails.  The second video portrays two coins; the left coin shows heads, the right coin shows tails.  The third video also portrays two coins; the left coin shows heads, the right coin shows heads. You haven't been paying attention to the frequency, but now, having determined there are three videos you can see, you want to figure out how frequently each video shows up.  What are your prior odds for each video? 33/33/33 seems reasonable. I've specified that you're watching videos; the event is which video you are watching, not the events that unfold within the video. Now, consider an alternative framing: You are watching somebody as they repeat a series of events.  You have determined the events unfold in three distinct ways; all three begin the same way, with a coin being flipped.  If the coin shows heads, it is flipped again.  If the coin shows tails, it is not.  What are your prior odds for each sequence of events? 25/25/50 seems reasonable. Now, consider yet another framing: You are shown something on a looping computer screen.  You have determined the visuals unfold in three distinct ways; all three begin the same way, with a coin being flipped.  If the coin shows heads, it is flipped again.  If the coin shows tails, it is not.  What are your prior odds here? Both 25/25/50 and 33/33/33 are reasonable.  Why?  Because it is unclear whether or not you are watching a simulation of coin flips, or something like prerecorded videos;

I think that there's good reasons why the discussion on Less Wrong has turned increasingly towards AI Alignment, but I am also somewhat disappointed that there's no longer a space focusing on rationality per se.

Just as the Alignment forum exists as a separate space that automatically cross-posts to LW, I'm starting to wonder if we need a rationality forum that exists as a separate space that cross-posts to LW, as if I were just interested in improving my rationality I don't know if I'd come to Less Wrong.

(To clarify, unlike the Alignment Forum, I'd expect such a forum to be open-invite b/c the challenge would be gaining any content at all).

Alternatively, I think there is a way to hide the AI content on LW, but perhaps there should exist a very convenient and visible user interface for that. I would propose an extreme solution, like a banner on the top of the page containing a checkbox that hides all AI content. So that anyone, registered or not, could turn the AI content off in one click.
The Alignment forum works because there are a bunch of people who professionally pursue research over AI Alignment. There's no similar group of people for whom that's true for rationality. 
IIRC, you can get post on Alignment Forum only if you are invited or moderators crossposted it? The problem is that Alignment Forum is deliberately for some sort of professionals, but everyone wants to write about alignment. Maybe it would be better if we had "Alignment Forum for starters".
  • If I’d tried to assume that models are “trying” to get a low value for the training loss, I might have ended up relying on our ability to incentivize the model to make very long-term predictions. But I think that approach is basically a dead end.

Why do you believe that this is a dead-end?

I haven't listened to the whole interview, but it sounds like you might be reading more into it than is there.

Shane talked about the importance of checking the reasoning process given that reinforcement learning can lead to phenomena like deceptive alignment, but he didn't explain exactly how he hopes to deal with this other than saying that the reasoning process has to be checked very carefully.

This could potentially tie to some proposals such as approval-based agents, interpretability or externalized reasoning, but it wasn't clear to me how exactly he wa... (read more)

4Seth Herd1mo
I think you're right that I'm reading into this. But there is probably more to his thinking, whether I'm right or wrong about what that is. Shane Legg was thinking about alignment as far back as his PhD thesis, which doesn't go into depth on it but does show he'd at least read a some of the literature prior to 2008. I agree that LLM chain of thought is not totally reliable, but I don't think it makes sense to dismiss it as too unreliable to work with for an alignment solution. There's so much that hasn't been tried, both in making LLMs more reliable, and making agents built on top of them reliable by taking multiple paths, and using new context windows and different models to force them to break problems into steps, and use the last natural language statement as their whole context for the next step. Whether or not this is a reliable path to alignment, it's a potential path to huge profits. So there are two questions: will this lead to alignable AGI? And, will it lead to AGI. I think both are unanswered.

Thanks for the examples. The third example was good, the second was okay and the first and fourth didn't seem very good. Interested to see how this develops.

BTW, I was curious to see a concrete example where we applied the example to two different contexts.

Our inner game policy efforts are already bearing fruit. We can't precisely define exactly what will happen, but we certainly can push for nuance via this route than we would be able to through the public outreach route.

I can see why you would be a lot more positive on advocacy if you thought that crippling advances is a way out of our current crisis. Unfortunately, I fear that will just result in AI being built by whichever country/actor cares the least about safety. So I think we need more nuance than this.

I agree that there is a trade-off here, however:

a) Dumbing down the message will cost us support from ML engineers and researchers.
b) If the message is dumbed down too much, then the public is unlikely to create pressure towards the kinds of actions that will actually help as opposed to pressuring politicians to engage in shallow, signaling-driven responses.

2Ricardo Meneghin1mo
I think the idea we're going to be able to precisely steer government policy to achieve nuanced outcomes is dead on arrival - we've been failing at that forever. What's in our favor this time is that there are many more ways to cripple advance than to accelerate it, so it may be enough for the push to be simply directionally right for things to slow down (with a lot of collateral damage).

I strongly agree with Steven Byrnes suggestion that we consider things on a case-by-case basis. I think it's both clear that some people in the alignment community have already taken actions that have significantly shortened timelines and also that it's possible to go too far. I suspect that the earlier someone is in their career, the less they need to worry about accidentally causing capability externalities lest they end up paralysed, but the further people go, the more careful they have to be.

She thought it was spelling, not naming

Sorry, I'm still confused. She was pointing to objects and tapping to receive a name, so presumably she already knew that these words referred to objects.

I thought the revelation might be modularity. I don't know what this is called in linguistics.
Perhaps one can think of a sort of continuum where on one end you have a full understanding that it's a characteristic of language that "everything has a name" as in the Anne Sullivan quote, and on the other end, an individual knows certain gestures are associated with getting another person to exhibit certain behaviors like bringing desired objects to them, but no intuition that there's a whole system of gestures that they mostly haven't learned yet (as an example, a cat might know that rattling its food bowl will cause its owner to come over and refill it). Even if Hellen Keller was not all the way on the latter end of the continuum at the beginning of the story--she could already request new gestures for things she regularly wanted Anne Sullivan to bring to her or take her to--in the course of the story she might have made some significant leap in the direction of the former end of the continuum. In particular she might have realized that she could ask for names of all sorts of things even if there was no regular instrumental purpose for requesting that Sullivan would bring them over to her (e.g. being thirsty and wanting water). On the general topic of what the Helen Keller story can tell us about AI and whether complex sensory input is needed for humanlike understanding of words, a while ago I read an article at that suggests some reasons for caution. It notes that she was not born blind and deaf, but "lost her sight and hearing after an illness at the age of two", so even if she had no conscious memory of what vision and hearing were like, they would have figured into her brain development until that point, as would her exposure to language to that age. The end of the article discusses the techniques developed in Soviet institutions to help people who were actually born blind and deaf, like developing their sense of space by "gradually making the deaf/bli
4Hobgoblin of Stillwater1mo
While I have not read Anne Sullivan's original text nor a biography of Keller, and I cannot say for sure what was happening in her head, here is one plausible theory: For the longest time, despite learning many words for use in daily life, Keller did not actually grasp the concept of words being names of specific objects; rather, she regarded them as combinations of letters loosely associated with specific situations and sensations. For example, "mug" and "milk" and "drink", as far as she was concerned, were all just arbitrary combinations of signs that her teacher tended to utter in association with drinking milk. In this view, when describing Helen's prior attitude as follows: the teacher, Sullivan, is not actually speaking precisely: at that time, Keller did not actually want to know the 'name' of the object 'water'; she wanted to know 'what kind of letter combination is associated with the experience of washing'. Once again, this is just the way in which I understand it, and I'm not saying this is actually the way Helen Keller thought.

I think this is a fascinating idea, although I have to be honest that I don’t find the examples you’ve provided very compelling. In order to be persuaded of the usefulness of these techniques, I’d want to see more concrete examples, as when the examples are abstract it is very hard (and subjective) to evaluate how well it is doing at decoding a latent representation in a new context.

In case anyone finds it helpful, the short version of this post seems to be:

  1. Train a model to encode and decode text to and from a latent space
  2. Train a model to predict the
... (read more)
It's cool that this works (at least a bit)! It reminds me of the world models in RL agents. As these have an encoder, decoder, and latent space predictor (conditional on action). I wonder how long it will be before someone uses LLM's an explicit world model in an agent. Given the general power of pretrained LLM's, it may help with the data efficiency of RL agents (ignoring the LLM pretraining). Making an agent won't help with alignment, but having a world model (and its associated state) to inspect might.
So it's definitely not invincible, you do not get full control over the model with this technique yet. However I would have you notice a few things: 1. Very little optimization effort has been put into this technique, and text VAEs in general compared to GPT-N. Rather than think of this as the power the method has, think of it as the lower bound, the thing you can do with a modest compute budget and a few dedicated researchers. 2. I haven't yet implemented all of what I want in terms of inference techniques. A potentially big low hanging fruit is classifier free guidance, which is what took CLIP conditioned diffusion from mediocre to quite good. Ideally I'll be able to set up something like a Google CoLab or Gradio or HuggingFace Space to let people try the model themselves without setting up a local environment, since in practice it seems like models don't really exist to people unless there's a convenient way to inference with them in their browser or on their phone. In the meantime here's a simple example, I'd be happy to do context + prompt pairs submitted by the audience in this thread if people want to see what the model will do but don't want to go to the trouble of setting it up themselves. context = "A little girl was walking her dog on a clear day. Everything was bright and happy for the two. They stopped to admire a field of flowers and the dog sniffed at the dirt." (Pdb) apply_op(vae_model, router, context, "Without warning it began to rain. I had never seen such a torrential downpour before. Every part of my clothes got soaked and the streets filled with the waters of a spring rain. It made me happy because I like the rain, I enjoyed getting soaked.") '<s> A little girl was walking her dog on a clear day. Everything was bright and happy for the two. They stopped to admire a field of flowers and the dog sniffed at the dirt. The girl looked around and saw a purplish part of the turf. It made her so happy because every part of the turf looked the

I would really like him to do that. However I suspect that he feels that these arguments actually aren’t that strong when you look into it, so that might not make sense from his perspective.

I’m still confused by the Helen Keller example. It sounds like she already knew that she could ask for the names of objects, so I’m struggling to see what the realisation was that led her to excitedly ask about the names of a bunch of objects.

The way I read it, her teacher was trying to tell her about words, but she didn't make the connection between the words and mental objects (she thought it was spelling, not naming). Once she did, they became much more interesting.

One thing that me more comfortable with making statements that are less nuanced in some circumstances is Wittgenstein's idea of language games. Rationalists have a tendency of taking words literally, whilst Wittgenstein views statements as moves in a language games where there are a host of different language games for different situations and people can generally figure it out. Specifically, there seems to be some distinct language games associated with protests where people understand that your sign or slogan doesn't cover everything in complete nuance. At the same time, I think we should be trying to raise the bar in terms of the epistemics/openness of our advocacy work and I do see risks in people taking this reasoning too far.

2Ricardo Meneghin1mo
There is a massive tradeoff between nuance/high epistemic integrity and reach. The general population is not going to engage in complex nuanced arguments about this, and prestigious or high-power people who are able to understand the discussion and potentially steer government policy in a meaningful way won't engage in this type of protest for many reasons, so the movement should be ready for dumbing-down or at least simplifying the message in order to increase reach, or risk remaining a niche group (I think "Pause AI" is already a good slogan in that sense). 

I suspect that MIRI was prioritising alignment research over the communication of that research when they were optimistic about their alignment directions panning out. It feels like that was a reasonable bet to make, even though I do wish they'd communicated their perspective earlier (happily, they've been publishing a lot more recently).

The POC || GTFO article was very interesting.

I do worry though that it is mixing together pragmatics and epistemics (even though it does try to distinguish the two). Like there's a distinction between when it's reasonable to believe something and when it's reasonable to act upon something.

For example, when I was working as a web developer, there's lots of potential bugs where it would have made sense to believe that there was a decent chance we were vulnerable, but pragmatically we couldn't spare the time to fix every potential security issue. It doesn't m... (read more)

I agree, which is why I have an entire section or 2 about why I think ML/AI isn't like computer security.

Regarding security mindset, I think that where it really kicks in is when you have a system utilising its intelligence to work around any limitations such that you're no longer looking at a "broad, reasonable" distribution of space, but now a "very, specific" scenario that a powerful optimiser has pushed you towards. In that case, doing things like doubling the size may make your safety schemes if the AI now has the intelligence to get around it.

The problem here is that it shares a similar issue to optimization daemons/goal misgeneralization, etc, and a comment from Iceman sums it up perfectly: "or trying to translate that into lesswrongesse, you do not have warrant to believe in something until you have an example of the thing you're maybe worried about being a real problem because you are almost certain to be privileging the hypothesis." Or equivalently from lc: "you only start handing out status points after someone has successfully demonstrated the security failure, ideally in a deployed product or at the very least a toy program." This is to a large extent the issue I have with attempted breaks on alignment, in that pretty much no alignment break has been demonstrated, and the cases where they had, we have very mixed results to slight positive results at best.

On free will: I don't endorse the claim that "we could have acted differently" as an unqualified statement.

However, I do believe that in order to talk about decisions, we do need to grant validity to a counterfactual view where we could have acted differently as a pragmatically useful fiction.

What's the difference? Well, you can't use the second to claim determinism is false.

I argue that "I can do X" means "If I want to do X, I will do X". This can be true (as an unqualified statement) even with determinism. It is different from saying that X is physically possible.
This lack of contact with naive conception of possibility should be developed further, so that the reasons for temptation to use the word "fiction" dissolve. An object that captures a state of uncertainty doesn't necessarily come with a set of concrete possibilities that are all "really possible". The object itself is not "fictional", and its shadows in the form of sets of possibilities were never claimed to either be "real possibilities" or to sum up the object, so there is no fiction to be found. A central example of such an object is a program equipped with theorems about its "possible behaviors". Are these behaviors "really possible"? Some of them might be, but the theorems don't pin that down. Instead there are spaces on which the remaining possibilities are painted, shadows of behavior of the program as a whole, such as a set of possible tuples for a given pair of variables in the code. A theorem might say that reality lies within the particular part of the shadow pinned down by the theorem. One of those variables might've stood for your future decision. What "fiction"? All decision relevant possibility originates like that.

Do phase transitions actually show up? So far, the places where theoretically predicted phase transitions are easiest to confirm are simplified settings like deep linear networks and toy models of superposition. For larger models, we expect phase transitions to be common but "hidden." Among our immediate priorities are testing just how common these transitions are and whether we can detect hidden transitions.


What do you mean by 'hidden"?

Not easily detected.  As in, there might be a sudden (in SGD steps) change in the internal structure of the network over training that is not easily visible in the loss or other metrics that you would normally track. If you think of the loss as an average over performance on many thousands of subtasks, a change in internal structure (e.g. a circuit appearing in a phase transition) relevant to one task may not change the loss much.

Thinking this through.

There's a lot of ways in which speedrunning is like paperclip maximisation: speedrunning doesn't contribute to society and further paperclips after we've produced a certain amount become useless.

I'm still confused by the analogy though. Because seems like a lot of people may do speedrunning for fun - but maybe you see it as more about status - while paperclip production isn't fun. I think this makes a difference though, as even though we don't want our society to produce absurd amounts of paperclips, we probably do want lots of niche ways to have fun.

Competitive paperclip maximization in a controlled setting sounds like it might be fun. The important thing is that it's one thing that's fun out of many things, and variety is important.

How so: "There’s a lot more we hope to say about our new (and still evolving) strategy, and about our general thinking on the world’s (generally very dire) situation. But I don’t want those announcements to further delay sharing the above updates, so I’ve factored our 2023 strategy updates into multiple posts, beginning with this one."

Yep you're totally right

Interesting idea. Looking forward to seeing how this goes!

After skimming, I'm still confused. How do you actually want us to use these? GPT4 is undoubtedly useful, including for people trying to save the world, but it's not clear what we should want to do with it. So to for goal agnostic systems.

I intentionally left out the details of "what do we do with it" because it's conceptually orthogonal to goal agnosticism and is a huge topic of its own. It comes down to the class of solutions enabled by having extreme capability that you can actually use without it immediately backfiring. For example, I think this has a real shot at leading to a strong and intuitively corrigible system. I say "intuitively" here because the corrigibility doesn't arise from a concise mathematical statement that solves the original formulation. Instead, it lets us aim it at an incredibly broad and complex indirect specification that gets us all the human messiness we want.

I don’t know, but would love to find out.

1Nate Showell1mo
I asked on Discord and someone told me this: 

I claim that you also need a could_C "Asher counterfactually could drive off the cliff" unless you want to be eliminativist about counterfactuals.

I've written about this here. Eliezer seems to have found the same solution that I did for the student and exam problem: there's a distinction between a) being fixed independently of your action b) being fixed given your action.

I wrote in one of my footnotes:

You may object that RLHF is mostly capabilities. I also tend to think about it as being primarily a capabilities advance, but it is an advance in alignment as well

ie. it belongs in the reference class when figuring out the difficulty of making progress on alignment.

Regarding scalability, I wrote:

I suspect that all of these approaches are still very far away from where we need to be. I consider them substantial advances nonetheless for two key reasons: having a baseline helps people choose an appropriate level of ambition, and

... (read more)
3Nate Showell2mo
Has anyone developed a metric for quantifying the level of linearity versus nonlinearity of a model's representations? A metric like that would let us compare the levels of linearity for models of different sizes, which would help us extrapolate whether interpretability and alignment techniques that rely on approximate linearity will scale to larger models.

I'm working on an adversarial collaboration with Abram Demski atm, but after that's finished, I'd be open to having a dialog here on decision theory.

I'm skeptical because in addition to logical reasoning, intuitive reasoning seems pretty important. And I'm not sure if there's a simpler representation of intuitive reasoning that a bunch of weights from a bunch of concepts to another concept.

Check out this great paper: "From Word Models to World Models: Translating from Natural Language to the Probabilistic Language of Thought" It proposes "probabilistic programming" as a formal "Probabilistic Language of Thought" (PLoT) with precise formal Bayesian reasoning. They show in 4 domains how a large language model can convert an informal statement or chain of reasoning into a precise probabilistic program, do precise Bayesian reasoning on that, and then convert the results back into informal natural language.

Your comment focuses on GPT4 being "pretty good at extracting preferences from human data" when the stronger part of the argument seems to be that "it will also generally follow your intended directions, rather than what you literally said".

I agree with you that it was obvious in advance that a superintelligence would understand human value.

However, it sure sounded like you thought we'd have to specify each little detail of the value function. GPT4 seems to suggest that the biggest issue will be a situation where:

1) The AI has an option that would produce ... (read more)

I'm leaving that open to interpretation. Feel free to add answers relevant to your interpretation.

Load More