Rob Bensinger

Communications lead at MIRI. Unless otherwise indicated, my posts and comments here reflect my own views, and not necessarily my employer's.

Sequences

2022 MIRI Alignment Discussion
2021 MIRI Conversations
Naturalized Induction

Wiki Contributions

Load More

Comments

The verbatim statement is:

We have people in crypto who are good at breaking things, and they're the reason why anything is not on fire. And some of them might go into breaking AI systems instead, 'cause that's where you learn anything.

You know, you know, any fool can build a crypto system that they think will work. Breaking existing crypto systems -- cryptographical systems -- is how we learn who the real experts are. So maybe the people finding weird stuff to do with AIs, maybe those people will come up with some truth about these systems that makes them easier to align than I suspect.

When he says "cryptographical systems", he's clarifying what he meant by "crypto" in the previous few clauses (this is a bit clearer from the video, where you can hear his tone). He often says stuff like this about cryptography and computer security; e.g., see the article Eliezer wrote on Arbital called Show me what you've broken:

See AI safety mindset. If you want to demonstrate competence at computer security, cryptography, or AI alignment theory, you should first think in terms of exposing technically demonstrable flaws in existing solutions, rather than solving entire problems yourself. Relevant Bruce Schneier quotes: "Good engineering involves thinking about how things can be made to work; the security mindset involves thinking about how things can be made to fail" and "Anyone can invent a security system that he himself cannot break. Show me what you've broken to demonstrate that your assertion of the system's security means something."

See also So Far: Unfriendly AI Edition:

And above all, aligning superhuman AI is hard for similar reasons to why cryptography is hard. If you do everything right, the AI won’t oppose you intelligently; but if something goes wrong at any level of abstraction, there may be powerful cognitive processes seeking out flaws and loopholes in your safety measures.

When you think a goal criterion implies something you want, you may have failed to see where the real maximum lies. When you try to block one behavior mode, the next result of the search may be another very similar behavior mode that you failed to block. This means that safe practice in this field needs to obey the same kind of mindset as appears in cryptography, of “Don’t roll your own crypto” and “Don’t tell me about the safe systems you’ve designed, tell me what you’ve broken if you want me to respect you” and “Literally anyone can design a code they can’t break themselves, see if other people can break it” and “Nearly all verbal arguments for why you’ll be fine are wrong, try to put it in a sufficiently crisp form that we can talk math about it” and so on. (AI safety mindset)

And Security Mindset and Ordinary Paranoia.

I'm happy you linkposted this so people could talk about it! The transcript above is extremely error-laden, though, to the extent I'm not sure there's much useful signal here unless you read with extreme care?

I've tried to fix the transcription errors, and posted a revised version at the bottom of this post (minus the first 15 minutes, which are meta/promotion stuff for Bankless). I vote for you copying over the Q&A transcript here so it's available both places.

Do you know of any arguments with a similar style to The Most Important Century that is as pessimistic as EY/MIRI folks (>90% probability of AGI within 15 years)?

Wait, what? Why do you think anyone at MIRI assigns >90% probability to AGI within 15 years? That sounds wildly too confident to me. I know some MIRI people who assign 50% probability to AGI by 2038 or so (similar to Ajeya Cotra's recently updated view), and I believe Eliezer is higher than 50% by 2038, but if you told me that Eliezer told you in a private conversation "90+% within 15 years" I would flatly not believe you.

I don't think timelines have that much to do with why Eliezer and Nate and I are way more pessimistic than the Open Phil crew.

Thanks for posting this, Andrea_Miotti and remember! I noticed a lot of substantive errors in the transcript (and even more errors in vonk's Q&A transcript), so I've posted an edited version of both transcripts. I vote that you edit your own post to include the revisions I made.

Here's a small sample of the edits I made, focusing on ones where someone may have come away from your transcript with a wrong interpretation or important missing information (as opposed to, e.g., the sentences that are just very hard to parse in the original transcript because too many filler words and false starts to sentences were left in):

  • Predictions are hard, especially about the future. I sure hope that this is where it saturates. This is like the next generation. It goes only this far, it goes no further
    • Predictions are hard, especially about the future. I sure hope that this is where it saturates — this or the next generation, it goes only this far, it goes no further
  • the large language model technologies, basic vulnerabilities, that's not reliable.
    • the large language model technologies’ basic vulnerability is that it’s not reliable
  • So you're saying this is super intelligence, we'd have to imagine something that knows all of the chess moves in advance. But here we're not talking about chess, we're talking about everything.
    • So you're saying [if something is a] superintelligence, we'd have to imagine something that knows all of the chess moves in advance. But here we're not talking about chess, we're talking about everything.
  • Ryan: The dumb way to ask that question too is like, Eliezer, why do you think that the AI automatically hates us? Why is it going to- It doesn't hate you. Why does it want to kill us all?
    • Ryan: The dumb way to ask that question too is like, Eliezer, why do you think that the AI automatically hates us? Why is it going to—

      Eliezer:  It doesn't hate you.

       Ryan: Why does it want to kill us all?
  • That's an irreducible source of uncertainty with respect to superintelligence or anything that's smarter than you. If you could predict exactly what it would do, it'd be that smart. Yourself, it doesn't mean you can predict no facts about it.
    • That's an irreducible source of uncertainty with respect to superintelligence or anything that's smarter than you. If you could predict exactly what it would do, you'd be that smart yourself. It doesn't mean you can predict no facts about it.
  • Eliezer: I mean, I could say something like shut down all the large GPU clusters. How long do I have God mode? Do I get to like stick around?
    • Eliezer: I mean, I could say something like shut down all the large GPU clusters. How long do I have God mode? Do I get to like stick around for seventy years?
  • Ryan: And do you think that's what happens? Yeah, it doesn't help with that. We would see evidence of AIs, wouldn't we?

    Ryan:  Yeah. Yes. So why don't we?
    • Ryan: And do you think that's what happens? Yeah, it doesn't help with that. We would see evidence of AIs, wouldn't we?

      Eliezer: Yeah.

      Ryan:  Yes. So why don't we?
  • It's surprising if the thing that you're wrong about causes the rocket to go twice as high on half the fuel you thought was required and be much easier to steer than you were afraid of. The analogy I usually use for this is, very early on in the Manhattan Project, they were worried about what if the nuclear weapons can ignite fusion in the nitrogen in the atmosphere. 
    • It's surprising if the thing that you're wrong about causes the rocket to go twice as high on half the fuel you thought was required and be much easier to steer than you were afraid of.

      Ryan: So, are you...

      David: Where the alternative was, “If you’re wrong about something, the rocket blows up.”

      Eliezer: Yeah. And then the rocket ignites the atmosphere, is the problem there.

      O rather: a bunch of rockets blow up, a bunch of rockets go places... The analogy I usually use for this is, very early on in the Manhattan Project, they were worried about “What if the nuclear weapons can ignite fusion in the nitrogen in the atmosphere?”
  • But you're saying if we do that too much, all of a sudden the system will ignite the whole entire sky, and then we will all know.

    Eliezer: You can run chatGPT any number of times without igniting the atmosphere.
    • But you're saying if we do that too much, all of a sudden the system will ignite the whole entire sky, and then we will all...

      Eliezer: Well, no. You can run ChatGPT any number of times without igniting the atmosphere.
  • I mean, we have so far not destroyed the world with nuclear weapons, and we've had them since the 1940s. Yeah, this is harder than nuclear weapons. Why is this harder?
    • I mean, we have so far not destroyed the world with nuclear weapons, and we've had them since the 1940s.

      Eliezer: Yeah, this is harder than nuclear weapons. This is a lot harder than nuclear weapons.

      Ryan: Why is this harder?
  • And there's all kinds of, like, fake security. It's got a password file. This system is secure. It only lets you in if you type a password.
    • And there's all kinds of, like, fake security. “It's got a password file! This system is secure! It only lets you in if you type a password!”
  • And if you never go up against a really smart attacker, if you never go far to distribution against a powerful optimization process looking for holes,
    • And if you never go up against a really smart attacker, if you never go far out of distribution against a powerful optimization process looking for holes,
  • Do they do, are we installing UVC lights in public, in, in public spaces or in ventilation systems to prevent the next respiratory born pandemic respiratory pandemic? It is, you know, we, we, we, we lost a million people and we sure did not learn very much as far as I can tell for next time. We could have an AI disaster that kills a hundred thousand people. How do you even do that? Robotic cars crashing into each other, have a bunch of robotic cars crashing into each other.
    • Are we installing UV-C lights in public spaces or in ventilation systems to prevent the next respiratory pandemic? You know, we lost a million people and we sure did not learn very much as far as I can tell for next time.

      We could have an AI disaster that kills a hundred thousand people—how do you even do that? Robotic cars crashing into each other? Have a bunch of robotic cars crashing into each other! It's not going to look like that was the fault of artificial general intelligence because they're not going to put AGIs in charge of cars.
  • Guern
    • Gwern
  • When I dive back into the pool, I don't know, maybe I will go off to conjecture or anthropic or one of the smaller concerns like Redwood Research, being the only ones I really trust at this point, but they're tiny, and try to figure out if I can see anything clever to do with the giant inscrutable matrices of floating point numbers.
    • When I dive back into the pool, I don't know, maybe I will go off to Conjecture or Anthropic or one of the smaller concerns like Redwood Research—Redwood Research being the only ones I really trust at this point, but they're tiny—and try to figure out if I can see anything clever to do with the giant inscrutable matrices of floating point numbers.
  • We have people in crypto who are good at breaking things, and they're the reason why anything is not on fire. Some of them might go into breaking AI systems instead because that's where you learn anything. Any fool can build a crypto system that they think will work. Breaking existing crypto systems, cryptographical systems is how we learn who the real experts are.
    • We have people in crypto[graphy] who are good at breaking things, and they're the reason why anything is not on fire. Some of them might go into breaking AI systems instead, because that's where you learn anything.

      You know: Any fool can build a crypto[graphy] system that they think will work. Breaking existing cryptographical systems is how we learn who the real experts are.
  • And who else disagrees with me? I'm sure Robin Hanson would be happy to come up. Well, I'm not sure he'd be happy to come on this podcast, but Robin Hanson disagrees with me, and I feel like the famous argument we had back in the early 2010s, late 2000s about how this would all play out. I basically feel like this was the Yudkowsky position, this is the Hanson position, and then reality was over here, well to the Yudkowsky side of the Yudkowsky position in the Yudkowsky-Hanson debate.
    • Who else disagrees with me? I'm sure Robin Hanson would be happy to come on... well, I'm not sure he'd be happy to come on this podcast, but Robin Hanson disagrees with me, and I kind of feel like the famous argument we had back in the early 2010s, late 2000s about how this would all play out—I basically feel like this was the Yudkowsky position, this is the Hanson position, and then reality was over here, well to the Yudkowsky side of the Yudkowsky position in the Yudkowsky-Hanson debate.
  • But Robin Hanson does not feel that way. I would probably be happy to expound on that at length.
    • But Robin Hanson does not feel that way, and would probably be happy to expound on that at length. 
  • Open sourcing all the demon summoning circles is not the correct solution. I'm not even using, and I'm using Elon Musk's own terminology here. And they talk about AI is summoning the demon,
    • Open sourcing all the demon summoning circles is not the correct solution. And I'm using Elon Musk's own terminology here. He talked about AI as “summoning the demon”,
  • You know, now, now the stuff that would, that was obvious back in 2015 is, you know, starting to become visible and distance to others and not just like completely invisible. 
    • You know, now the stuff that was obvious back in 2015 is, you know, starting to become visible in the distance to others and not just completely invisible.
  • I, I suspect that if there's hope at all, it comes from a technical solution because the difference between technical solution, technical problems and political problems is at least the technical problems have solutions in principle.
    • I suspect that if there's hope at all, it comes from a technical solution, because the difference between technical problems and political problems is at least the technical problems have solutions in principle.

Gratitude to Andrea_Miotti, remember, and vonk for posting more-timely transcripts of this so LW could talk about it at the time -- and for providing a v1 transcript to give me a head start.

Here's a small sample of the edits I made to the previous Bankless transcript on LW, focusing on ones where someone may have come away from the original transcript with a wrong interpretation or important missing information (as opposed to, e.g., the sentences that are just very hard to parse in the original transcript because too many filler words and false starts to sentences were left in):

  • Predictions are hard, especially about the future. I sure hope that this is where it saturates. This is like the next generation. It goes only this far, it goes no further
    • Predictions are hard, especially about the future. I sure hope that this is where it saturates — this or the next generation, it goes only this far, it goes no further
  • the large language model technologies, basic vulnerabilities, that's not reliable.
    • the large language model technologies’ basic vulnerability is that it’s not reliable
  • So you're saying this is super intelligence, we'd have to imagine something that knows all of the chess moves in advance. But here we're not talking about chess, we're talking about everything.
    • So you're saying [if something is a] superintelligence, we'd have to imagine something that knows all of the chess moves in advance. But here we're not talking about chess, we're talking about everything.
  • Ryan: The dumb way to ask that question too is like, Eliezer, why do you think that the AI automatically hates us? Why is it going to- It doesn't hate you. Why does it want to kill us all?
    • Ryan: The dumb way to ask that question too is like, Eliezer, why do you think that the AI automatically hates us? Why is it going to—

      Eliezer:  It doesn't hate you.

       Ryan: Why does it want to kill us all?
  • That's an irreducible source of uncertainty with respect to superintelligence or anything that's smarter than you. If you could predict exactly what it would do, it'd be that smart. Yourself, it doesn't mean you can predict no facts about it.
    • That's an irreducible source of uncertainty with respect to superintelligence or anything that's smarter than you. If you could predict exactly what it would do, you'd be that smart yourself. It doesn't mean you can predict no facts about it.
  • Eliezer: I mean, I could say something like shut down all the large GPU clusters. How long do I have God mode? Do I get to like stick around?
    • Eliezer: I mean, I could say something like shut down all the large GPU clusters. How long do I have God mode? Do I get to like stick around for seventy years?
  • Ryan: And do you think that's what happens? Yeah, it doesn't help with that. We would see evidence of AIs, wouldn't we?

    Ryan:  Yeah. Yes. So why don't we?
    • Ryan: And do you think that's what happens? Yeah, it doesn't help with that. We would see evidence of AIs, wouldn't we?

      Eliezer: Yeah.

      Ryan:  Yes. So why don't we?
  • It's surprising if the thing that you're wrong about causes the rocket to go twice as high on half the fuel you thought was required and be much easier to steer than you were afraid of. The analogy I usually use for this is, very early on in the Manhattan Project, they were worried about what if the nuclear weapons can ignite fusion in the nitrogen in the atmosphere. 
    • It's surprising if the thing that you're wrong about causes the rocket to go twice as high on half the fuel you thought was required and be much easier to steer than you were afraid of.

      Ryan: So, are you...

      David: Where the alternative was, “If you’re wrong about something, the rocket blows up.”

      Eliezer: Yeah. And then the rocket ignites the atmosphere, is the problem there.

      O rather: a bunch of rockets blow up, a bunch of rockets go places... The analogy I usually use for this is, very early on in the Manhattan Project, they were worried about “What if the nuclear weapons can ignite fusion in the nitrogen in the atmosphere?”
  • But you're saying if we do that too much, all of a sudden the system will ignite the whole entire sky, and then we will all know.

    Eliezer: You can run chatGPT any number of times without igniting the atmosphere.
    • But you're saying if we do that too much, all of a sudden the system will ignite the whole entire sky, and then we will all...

      Eliezer: Well, no. You can run ChatGPT any number of times without igniting the atmosphere.
  • I mean, we have so far not destroyed the world with nuclear weapons, and we've had them since the 1940s. Yeah, this is harder than nuclear weapons. Why is this harder?
    • I mean, we have so far not destroyed the world with nuclear weapons, and we've had them since the 1940s.

      Eliezer: Yeah, this is harder than nuclear weapons. This is a lot harder than nuclear weapons.

      Ryan: Why is this harder?
  • And there's all kinds of, like, fake security. It's got a password file. This system is secure. It only lets you in if you type a password.
    • And there's all kinds of, like, fake security. “It's got a password file! This system is secure! It only lets you in if you type a password!”
  • And if you never go up against a really smart attacker, if you never go far to distribution against a powerful optimization process looking for holes,
    • And if you never go up against a really smart attacker, if you never go far out of distribution against a powerful optimization process looking for holes,
  • Do they do, are we installing UVC lights in public, in, in public spaces or in ventilation systems to prevent the next respiratory born pandemic respiratory pandemic? It is, you know, we, we, we, we lost a million people and we sure did not learn very much as far as I can tell for next time. We could have an AI disaster that kills a hundred thousand people. How do you even do that? Robotic cars crashing into each other, have a bunch of robotic cars crashing into each other.
    • Are we installing UV-C lights in public spaces or in ventilation systems to prevent the next respiratory pandemic? You know, we lost a million people and we sure did not learn very much as far as I can tell for next time.

      We could have an AI disaster that kills a hundred thousand people—how do you even do that? Robotic cars crashing into each other? Have a bunch of robotic cars crashing into each other! It's not going to look like that was the fault of artificial general intelligence because they're not going to put AGIs in charge of cars.
  • Guern
    • Gwern
  • When I dive back into the pool, I don't know, maybe I will go off to conjecture or anthropic or one of the smaller concerns like Redwood Research, being the only ones I really trust at this point, but they're tiny, and try to figure out if I can see anything clever to do with the giant inscrutable matrices of floating point numbers.
    • When I dive back into the pool, I don't know, maybe I will go off to Conjecture or Anthropic or one of the smaller concerns like Redwood Research—Redwood Research being the only ones I really trust at this point, but they're tiny—and try to figure out if I can see anything clever to do with the giant inscrutable matrices of floating point numbers.
  • We have people in crypto who are good at breaking things, and they're the reason why anything is not on fire. Some of them might go into breaking AI systems instead because that's where you learn anything. Any fool can build a crypto system that they think will work. Breaking existing crypto systems, cryptographical systems is how we learn who the real experts are.
    • We have people in crypto[graphy] who are good at breaking things, and they're the reason why anything is not on fire. Some of them might go into breaking AI systems instead, because that's where you learn anything.

      You know: Any fool can build a crypto[graphy] system that they think will work. Breaking existing cryptographical systems is how we learn who the real experts are.
  • And who else disagrees with me? I'm sure Robin Hanson would be happy to come up. Well, I'm not sure he'd be happy to come on this podcast, but Robin Hanson disagrees with me, and I feel like the famous argument we had back in the early 2010s, late 2000s about how this would all play out. I basically feel like this was the Yudkowsky position, this is the Hanson position, and then reality was over here, well to the Yudkowsky side of the Yudkowsky position in the Yudkowsky-Hanson debate.
    • Who else disagrees with me? I'm sure Robin Hanson would be happy to come on... well, I'm not sure he'd be happy to come on this podcast, but Robin Hanson disagrees with me, and I kind of feel like the famous argument we had back in the early 2010s, late 2000s about how this would all play out—I basically feel like this was the Yudkowsky position, this is the Hanson position, and then reality was over here, well to the Yudkowsky side of the Yudkowsky position in the Yudkowsky-Hanson debate.
  • But Robin Hanson does not feel that way. I would probably be happy to expound on that at length.
    • But Robin Hanson does not feel that way, and would probably be happy to expound on that at length. 
  • Open sourcing all the demon summoning circles is not the correct solution. I'm not even using, and I'm using Elon Musk's own terminology here. And they talk about AI is summoning the demon,
    • Open sourcing all the demon summoning circles is not the correct solution. And I'm using Elon Musk's own terminology here. He talked about AI as “summoning the demon”,
  • You know, now, now the stuff that would, that was obvious back in 2015 is, you know, starting to become visible and distance to others and not just like completely invisible. 
    • You know, now the stuff that was obvious back in 2015 is, you know, starting to become visible in the distance to others and not just completely invisible.
  • I, I suspect that if there's hope at all, it comes from a technical solution because the difference between technical solution, technical problems and political problems is at least the technical problems have solutions in principle.
    • I suspect that if there's hope at all, it comes from a technical solution, because the difference between technical problems and political problems is at least the technical problems have solutions in principle.

The Q&A transcript on LW is drastically worse, to the point that it might well reduce the net accuracy of readers' beliefs if they aren't careful? I won't try to summarize all the important fixes I made to that transcript, because there are so many. I also cut out the first 15 minutes of the Q&A, which are Eliezerless and mostly consist of Bankless ads and announcements.

But this seems to contradict the element of Non-Deception. If you're not actually on the same side as the people who disagree with you, why would you (as a very strong but defeasible default) role-play otherwise?

This is a good question!! Note that in the original footnote in my post, "on the same side" is a hyperlink going to a comment by Val:

"Some version of civility and/or friendliness and/or a spirit of camaraderie and goodwill seems like a useful ingredient in many discussions. I'm not sure how best to achieve this in ways that are emotionally honest ('pretending to be cheerful and warm when you don't feel that way' sounds like the wrong move to me), or how to achieve this without steering away from candor, openness, 'realness', etc."

I think the core thing here is same-sidedness.

That has nothing to do directly with being friendly/civil/etc., although it'll probably naturally result in friendliness/etc.

(Like you seem to, I think aiming for cheerfulness/warmth/etc. is rather a bad idea.)

If you & I are arguing but there's a common-knowledge undercurrent of same-sidedness, then even impassioned and cutting remarks are pretty easy to take in stride. "No, you're being stupid here, this is what we've got to attend to" doesn't get taken as an actual personal attack because the underlying feeling is of cooperation. Not totally unlike when affectionate friends say things like "You're such a jerk."

This is totally different from creating comfort. I think lots of folk get this one confused. Your comfort is none of my business, and vice versa. If I can keep that straight while coming from a same-sided POV, and if you do something similar, then it's easy to argue and listen both in good faith.

I think this is one piece of the puzzle. I think another piece is some version of "being on the same side in this sense doesn't entail agreeing about the relevant facts; the goal isn't to trick people into thinking your disagreements are small, it's to make typical disagreements feel less like battles between warring armies".

I don't think this grounds out in simple mathematics that transcends brain architecture, but I wouldn't be surprised if it grounds out in pretty simple and general facts about how human brains happen to work. (I do think the principle being proposed here hasn't been stated super clearly, and hasn't been argued for super clearly either, and until that changes it should be contested and argued about rather than taken fully for granted.)

But why should we err at all? Should we not, rather, use as many carrots and sticks as is optimal?

"Err on the side of X" here doesn't mean "prefer erring over optimality"; it means "prefer errors in direction X over errors in the other direction". This is still vague, since it doesn't say how much to care about this difference; but it's not trivial advice (or trivially mistaken).

so when I see the brand name being used to market a particular set of discourse norms without a clear explanation of how these norms are derived from the law, that bothers me enough to quickly write an essay or two about it

Seems great to me! I share your intuition that Goodwill seems a bit odd to include. I think it's right to push back on proposed norms like these and talk about how justified they are, and I hope my list can be the start of a conversation like that rather than the end.

I do have an intuition that Goodwill, or something similar to Goodwill, plays an important role in the vast majority of human discourse that reliably produces truth. But I'm not sure why; if I knew very crisply what was going on here, maybe I could reduce it to other rules that are simpler and more universal.

Basically the fact LW has far more arguments for "alignment will be hard" compared to alignment being easy is the selection effect I'm talking about.

That could either be 'we're selecting for good arguments, and the good arguments point toward alignment being hard', or it could be a non-epistemic selection effect.

Why do you think it's a non-epistemic selection effect? It's easier to find arguments for 'the Earth is round' than 'the Earth is flat', but that doesn't demonstrate a non-epistemic bias.

I was also worried because ML people don't really think that AGI poses an existential risk, and that's evidence, in an Aumann sense.

... By 'an Aumann sense' do you just mean 'if you know nothing about a brain, then knowing it believes P is some Bayesian evidence for the truth of P'? That seems like a very weird way to use "Aumann", but if that's what you mean then sure. It's trivial evidence to anyone who's spent much time poking at the details, but it's evidence.

I think a more likely thing we'd want to stick around to do in that world is 'try to accelerate humanity to AGI ASAP'. "Sufficiently advanced AGI converges to human-friendly values" is weaker than "AGI will just have human-friendly values by default".

Load More