You apparently completely misunderstood the point we were making with the white box thing.
I think you need to taboo the term white box and come up with a new term that will result in less confusion/less people talking past each other.
Some thoughts since you ask for feedback elsewhere:
1) Part of the reason why this post is likely being downvoted is the clickbait title. This is not looked upon favorable on Less Wrong.
2) You make some pretty good points in this post, but you state it far too confidently, almost like a mathematical proof. If you want your posts to do well on Less Wrong, try not to make strong claims without correspondingly strong evidence.
Gabriel makes a very good point: there is something of a tension between allowing reign of terror moderation and considering it a norm violation to request the deletion of comments for low quality.
(TBH, I was convinced that reign of terror would be a disaster, but it seems to be working out okay so far).
I imagine an explanation that gets progressively longer. One-paragraph summary. One-screen explanation. One article containing the gist of the Sequences (with links to Read the Sequences), a brief history of the rationalist movement, frequently used concepts (with links to explanations), maybe a list of frequent myths.
Technically, all three could be at the same URL.
E/acc seems to be really fired up about this:
https://twitter.com/ctjlewis/status/1725745699046948996
AI alignment certification and peacebuilding seem like two very different and distinct projects. I'd strongly suggest picking one.
I'm confused by your Fast Deceptive Mesaoptimiser example. How does the speed prior come in here? It isn't mentioned in the description.
I also notice that I am just afraid of what would happen if I were to e.g. write a post that's just like "an overview over the EA-ish/X-risk-ish policy landscape" that names specific people and explains various historical plans. Like I expect it would make me a lot of enemies.
This seems like a bad idea.
Transparency is important, but ideally, we would find ways to increase this without blowing up a bunch of trust within the community. I guess I'd question whether this is really the bottleneck in terms of transparency/public trust.
I'm worried that as a response to FTX we might end up turning this into a much more adversarial space.
I think a better plan looks something like "You can't open source a system until you've determined and disclosed the sorts of threat models your system will enable, and society has implemented measures to become robust to these threat models. Once any necessary measures have been implemented, you are free to open-source."
The problem with this plan is that it assumes that there are easy ways to robustify the world. What if the only proper defense against bioweapons is a complete monitoring of the entire internet? Perhaps this is something that we'd li...
For the record, I updated on ChatGPT. I think that the classic example of imagining telling an AI to get a coffee and it pushes a kid out of the way isn't so much of a concern any more. So the remaining concerns seem to be inner alignment + outer alignment far outside normal human experience + value lock-in.
Thanks for highlighting the issue with the discourse here. People use the word evidence in two different ways which often results in people talking past one another.
I'm using your broader definition, where I imagine that Stella is excluding things that don't meet a more stringent standard.
And my claim is that reasoning under uncertainty sometimes means making decisions based on evidence[1] that is weaker than we'd like.
Broad definition
Oh, I don't think it actually would end up being temporary, because I expect with high probability that the empirical results of more robust evaluations would confirm that open-source AI is indeed dangerous. I meant temporary in the sense that the initial restrictions might either a) have a time-limit b) be subjective to re-evaluation at a specified point.
Define evidence.
I'm not asking this just to be pedantic, but because I think it'll make the answer to your objection clearer.
Even with its flaws, this study is sufficient evidence for us to want to enact temporary regulation at the same time as we work to provide more robust evaluations.
Note -- that if I thought regulations would be temporary, or had a chance of loosening over time after evals found that the risks from models at compute size X would not be catastrophic, I would be much less worried about all the things I'm worried about re. open source and power and and banning open source
But I just don't think that most regulations will be temporary. A large number of people...
I don't know if you need professionals, just a bunch of people who are interested in discussing the topic. It wouldn't need to use the Alignment Forum's invite-only system.
Instead, it would just be a way to allow LW to cater to both audiences at the same time.
I'm still pretty skeptical of what would happen without explicit focus. The Bletchley Park declaration was a super vague and applause-lighty declaration, which fortunately mentions issues of control, but just barely. It's not clear to me yet that this will end up receiving much-dedicated focus.
Regarding biosecurity and cyber, my big worry here is open-source and it seems totally plausible that a government will pass mostly sensible regulation, then create a massive gaping hole where open-source regulation should be.
It’s also plausible that the US government will ban or seriously restrict open source frontier models. Section 4.6 of the EO requests a government report on the costs and benefits of open source frontier models. Companies are required to report on what steps they take to “secure model weights.” These are the kinds of actions the government would take if they were concerned about open source models and thinking about banning them.
Any chance that you could make this more concrete by specifying such a proposal? I expect it'd be possible to make up an example if you want to avoid criticising any specific project.
I have seen several people say that EAs should focus on promoting stupid legislation that slows down AI incidentally, since that's "our best hope" to make sure things go well. In one of the footnotes, I cited an example of someone making this argument.
While this example could be dismissed as a weakman, I've also seen more serious proposals that I believe share both this theme and tone. This is how I currently perceive some of the "AI pause" proposals, especially those that fail to specify a mechanism to adjust regulatory strictness in response to new evide...
“If you then say we can split A into A1 and A2, you have added information to the problem. Like the Monty Hall problem, information can change the odds in unexpected ways!” - It’s not clear which is the baseline.
I think that there's good reasons why the discussion on Less Wrong has turned increasingly towards AI Alignment, but I am also somewhat disappointed that there's no longer a space focusing on rationality per se.
Just as the Alignment forum exists as a separate space that automatically cross-posts to LW, I'm starting to wonder if we need a rationality forum that exists as a separate space that cross-posts to LW, as if I were just interested in improving my rationality I don't know if I'd come to Less Wrong.
(To clarify, unlike the Alignment Forum, I'd expect such a forum to be open-invite b/c the challenge would be gaining any content at all).
- If I’d tried to assume that models are “trying” to get a low value for the training loss, I might have ended up relying on our ability to incentivize the model to make very long-term predictions. But I think that approach is basically a dead end.
Why do you believe that this is a dead-end?
I haven't listened to the whole interview, but it sounds like you might be reading more into it than is there.
Shane talked about the importance of checking the reasoning process given that reinforcement learning can lead to phenomena like deceptive alignment, but he didn't explain exactly how he hopes to deal with this other than saying that the reasoning process has to be checked very carefully.
This could potentially tie to some proposals such as approval-based agents, interpretability or externalized reasoning, but it wasn't clear to me how exactly he wa...
Thanks for the examples. The third example was good, the second was okay and the first and fourth didn't seem very good. Interested to see how this develops.
BTW, I was curious to see a concrete example where we applied the example to two different contexts.
Our inner game policy efforts are already bearing fruit. We can't precisely define exactly what will happen, but we certainly can push for nuance via this route than we would be able to through the public outreach route.
I can see why you would be a lot more positive on advocacy if you thought that crippling advances is a way out of our current crisis. Unfortunately, I fear that will just result in AI being built by whichever country/actor cares the least about safety. So I think we need more nuance than this.
I agree that there is a trade-off here, however:
a) Dumbing down the message will cost us support from ML engineers and researchers.
b) If the message is dumbed down too much, then the public is unlikely to create pressure towards the kinds of actions that will actually help as opposed to pressuring politicians to engage in shallow, signaling-driven responses.
I strongly agree with Steven Byrnes suggestion that we consider things on a case-by-case basis. I think it's both clear that some people in the alignment community have already taken actions that have significantly shortened timelines and also that it's possible to go too far. I suspect that the earlier someone is in their career, the less they need to worry about accidentally causing capability externalities lest they end up paralysed, but the further people go, the more careful they have to be.
She thought it was spelling, not naming
Sorry, I'm still confused. She was pointing to objects and tapping to receive a name, so presumably she already knew that these words referred to objects.
I think this is a fascinating idea, although I have to be honest that I don’t find the examples you’ve provided very compelling. In order to be persuaded of the usefulness of these techniques, I’d want to see more concrete examples, as when the examples are abstract it is very hard (and subjective) to evaluate how well it is doing at decoding a latent representation in a new context.
In case anyone finds it helpful, the short version of this post seems to be:
I would really like him to do that. However I suspect that he feels that these arguments actually aren’t that strong when you look into it, so that might not make sense from his perspective.
I’m still confused by the Helen Keller example. It sounds like she already knew that she could ask for the names of objects, so I’m struggling to see what the realisation was that led her to excitedly ask about the names of a bunch of objects.
One thing that me more comfortable with making statements that are less nuanced in some circumstances is Wittgenstein's idea of language games. Rationalists have a tendency of taking words literally, whilst Wittgenstein views statements as moves in a language games where there are a host of different language games for different situations and people can generally figure it out. Specifically, there seems to be some distinct language games associated with protests where people understand that your sign or slogan doesn't cover everything in complete nuance. At the same time, I think we should be trying to raise the bar in terms of the epistemics/openness of our advocacy work and I do see risks in people taking this reasoning too far.
I suspect that MIRI was prioritising alignment research over the communication of that research when they were optimistic about their alignment directions panning out. It feels like that was a reasonable bet to make, even though I do wish they'd communicated their perspective earlier (happily, they've been publishing a lot more recently).
The POC || GTFO article was very interesting.
I do worry though that it is mixing together pragmatics and epistemics (even though it does try to distinguish the two). Like there's a distinction between when it's reasonable to believe something and when it's reasonable to act upon something.
For example, when I was working as a web developer, there's lots of potential bugs where it would have made sense to believe that there was a decent chance we were vulnerable, but pragmatically we couldn't spare the time to fix every potential security issue. It doesn't m...
Regarding security mindset, I think that where it really kicks in is when you have a system utilising its intelligence to work around any limitations such that you're no longer looking at a "broad, reasonable" distribution of space, but now a "very, specific" scenario that a powerful optimiser has pushed you towards. In that case, doing things like doubling the size may make your safety schemes if the AI now has the intelligence to get around it.
On free will: I don't endorse the claim that "we could have acted differently" as an unqualified statement.
However, I do believe that in order to talk about decisions, we do need to grant validity to a counterfactual view where we could have acted differently as a pragmatically useful fiction.
What's the difference? Well, you can't use the second to claim determinism is false.
Do phase transitions actually show up? So far, the places where theoretically predicted phase transitions are easiest to confirm are simplified settings like deep linear networks and toy models of superposition. For larger models, we expect phase transitions to be common but "hidden." Among our immediate priorities are testing just how common these transitions are and whether we can detect hidden transitions.
What do you mean by 'hidden"?
Not easily detected. As in, there might be a sudden (in SGD steps) change in the internal structure of the network over training that is not easily visible in the loss or other metrics that you would normally track. If you think of the loss as an average over performance on many thousands of subtasks, a change in internal structure (e.g. a circuit appearing in a phase transition) relevant to one task may not change the loss much.
Thinking this through.
There's a lot of ways in which speedrunning is like paperclip maximisation: speedrunning doesn't contribute to society and further paperclips after we've produced a certain amount become useless.
I'm still confused by the analogy though. Because seems like a lot of people may do speedrunning for fun - but maybe you see it as more about status - while paperclip production isn't fun. I think this makes a difference though, as even though we don't want our society to produce absurd amounts of paperclips, we probably do want lots of niche ways to have fun.
How so: "There’s a lot more we hope to say about our new (and still evolving) strategy, and about our general thinking on the world’s (generally very dire) situation. But I don’t want those announcements to further delay sharing the above updates, so I’ve factored our 2023 strategy updates into multiple posts, beginning with this one."
After skimming, I'm still confused. How do you actually want us to use these? GPT4 is undoubtedly useful, including for people trying to save the world, but it's not clear what we should want to do with it. So to for goal agnostic systems.
I claim that you also need a could_C "Asher counterfactually could drive off the cliff" unless you want to be eliminativist about counterfactuals.
I've written about this here. Eliezer seems to have found the same solution that I did for the student and exam problem: there's a distinction between a) being fixed independently of your action b) being fixed given your action.
I wrote in one of my footnotes:
You may object that RLHF is mostly capabilities. I also tend to think about it as being primarily a capabilities advance, but it is an advance in alignment as well
ie. it belongs in the reference class when figuring out the difficulty of making progress on alignment.
...Regarding scalability, I wrote:
I suspect that all of these approaches are still very far away from where we need to be. I consider them substantial advances nonetheless for two key reasons: having a baseline helps people choose an appropriate level of ambition, and
I'm working on an adversarial collaboration with Abram Demski atm, but after that's finished, I'd be open to having a dialog here on decision theory.
I'm skeptical because in addition to logical reasoning, intuitive reasoning seems pretty important. And I'm not sure if there's a simpler representation of intuitive reasoning that a bunch of weights from a bunch of concepts to another concept.
Your comment focuses on GPT4 being "pretty good at extracting preferences from human data" when the stronger part of the argument seems to be that "it will also generally follow your intended directions, rather than what you literally said".
I agree with you that it was obvious in advance that a superintelligence would understand human value.
However, it sure sounded like you thought we'd have to specify each little detail of the value function. GPT4 seems to suggest that the biggest issue will be a situation where:
1) The AI has an option that would produce ...
I'm leaving that open to interpretation. Feel free to add answers relevant to your interpretation.
This is very tricky. On one hand, this may actually Streisand effect these results to greater prominence. On the other hand, at the point where people were specifically working around this to gain access to log-in gated LW resources, this would probably enhance our community status/prestige which might actually increase our influence.