Comments

“well, maybe that’s not the best way to think about trained networks and their effects”

This seems fine if you're trying to understand how current or near-future ML models work and how to make them safer, but I think in the longer run it seems inevitable that we eventually end up with AIs that are more or less well-descried as "goal-directed optimizers", so studying this concept probably won't be "wasted" even if it's not directly useful now.

Aside from a technical alignment perspective, it also seems strategically important to better understand how to model future goal-directed AIs, for example whether their decision/game theories will allow unaligned AIs to asymmetrically extort aligned AIs (or have more bargaining power because they have less to lose than aligned AI), or whether acausal trade will be a thing. This seems important input into various near-term decisions such as how much risk of unaligned AI we should tolerate.

Personally I prioritize studying metaphilosophy above topics directly related to "goal-directed optimizers" such as decision theory, as I see the former as a bit more urgent and neglected than the latter, but also find it hard to sympathize with describing the study of the latter as "weird".

Wei Dai and Robin Hanson seem to be gesturing at this point from different directions, how not doing philosophy correctly is liable to get us lost in the long term, and how getting lost in the long term is a basic fact of human condition and AIs don’t change that.

Interesting connection you draw here, but I don't see how "AIs don’t change that" can be justified (unless interpreted loosely to mean "there is risk either way"). From my perspective, AIs can easily make this problem better (stop the complacent value drift as you suggest, although so far I'm not seeing much evidence of urgency), or worse (differentially decelerate philosophical progress by being philosophically incompetent). What's your view on Robin's position?

How many times has someone expressed “I’m worried about ‘goal-directed optimizers’, but I’m not sure what exactly they are, so I’m going to work on deconfusion.”? There’s something weird about this sentiment, don’t you think? I can’t quite put my finger on what, and I wanted to get this post out.

This community inherited the concept of "goal-directed optimizers" and attempted formalizations of it from academia (e.g., vNM decision theory, AIXI). These academic ideas also clearly describe aspects of reality (e.g., decision theory having served as the foundation of economics for several decades now).

Given this, are we not supposed to be both worried (due to the threatening implications of modeling future AIs as goal-directed optimizers) and also confused (due to existing academic theories having various open problems)? Or what is the "not weird" response or course of action here?

I wrote a post expressing similar sentiments but perhaps with a different slant. To me, apparent human morality along the lines of "heretics deserve eternal torture in hell" or what was expressed during the Chinese Cultural Revolution are themselves largely a product of status games, and there's a big chance that these apparent values do not represent people's true values and instead represent some kind of error (but I'm not sure and would not want to rely on this being true). See also Six Plausible Meta-Ethical Alternatives for some relevant background.

But you're right that the focus of my post here is on people who endorse altruistic values that seem more reasonable to me, like EAs, and maybe earlier (pre-1949) Chinese supporters of communism who were mostly just trying to build a modern nation with a good economy and good governance, but didn't take seriously enough the risk that their plan would backfire catastrophically.

"China’s first attempt at industrialization started in 1861 under the Qing monarchy. Wen wrote that China “embarked on a series of ambitious programs to modernize its backward agrarian economy, including establishing a modern navy and industrial system.”

However, the effort failed to accomplish its mission over the next 50 years. Wen noted that the government was deep in debt and the industrial base was nowhere in sight." https://www.stlouisfed.org/on-the-economy/2016/june/chinas-previous-attempts-industrialization

Improving institutions is an extremely hard problem. The theory we have on it is of limited use (things like game theory, mechanism design, contract theory), and with AI governance/institutions specifically, we don't have much time for experimentation or room for failure.

So I think this is a fine frame, but doesn't really suggest any useful conclusions aside from same old "let's pause AI so we can have more time to figure out a safe path forward".

Current AIs are not able to “merge” with each other.

AI models are routinely merged by direct weight manipulation today. Beyond that, two models can be "merged" by training a new model using combined compute, algorithms, data, and fine-tuning.

As a result, humans don’t need to solve the problem of “What if a set of AIs form a unified coalition because they can flawlessly coordinate?” since that problem won’t happen while humans are still in charge. We can leave this problem to be solved by our smarter descendants.

How do you know a solution to this problem exists? What if there is no such solution once we hand over control to AIs, i.e., the only solution is to keep humans in charge (e.g. by pausing AI) until we figure out a safer path forward? As the last sentence you say "However, it’s perhaps significantly more likely in the very long-run." well what can we do today to reduce this long-run risk (aside from pausing AI which you're presumably not supporting)?

That said, it seems the probability of a catastrophic AI takeover in humanity’s relative near-term future (say, the next 50 years) is low (maybe 10% chance of happening).

Others already questioned you on this, but the fact you didn't think to mention whether this is 50 calendar years or 50 subjective years is also a big sticking point for me.

Who a decade ago thought that AI would think symbolically? I'm struggling to think of anyone. There was a debate on LW though, around "cleanly designed" versus "heuristics based" AIs, as to which might come first and which one safety efforts should be focused around. (This was my contribution to it.)

If someone had followed this discussion, there would be no need for dramatic updates / admissions of wrongness, just smoothly (more or less) changing one's credences as subsequent observations came in, perhaps becoming increasingly pessimistic if one's hope for AI safety mainly rested on actual AIs being "cleanly designed" (as Eliezer's did). (I guess I'm a bit peeved that you single out an example of "dramatic update" for praise, while not mentioning people who had appropriate uncertainty all along and updated constantly.)

Ah I see, thanks for the clarification. Personally I'm uncertain about this, and have some credence on each possibility, and may have written the OP to include both possibilities without explicitly distinguishing between them. See also #3 in this EAF comment and its followup for more of how I think about this.

Thanks for the pointers. I think these proposals are unlikely to succeed (or at least very risky) and/or liable to give people a false sense of security (that we've solved the problem when we actually haven't) absent a large amount of philosophical progress, which we're unlikely to achieve given how slow philosophical progress typically is and lack of resources/efforts. Thus I find it hard to understand why @evhub wrote "I’m less concerned about this; I think it’s relatively easy to give AIs “outs” here where we e.g. pre-commit to help them if they come to us with clear evidence that they’re moral patients in pain." if these are the kinds of ideas he has in mind.

Thanks, I think you make good points, but I take some issue with your metaethics.

Personally I’m a moral anti-realist

There is a variety of ways to not be a moral realist; are you sure you're an "anti-realist" and not a relativist or a subjectivist? (See Six Plausible Meta-Ethical Alternatives for short descriptions of these positions.) Or do you just mean that you're not a realist?

Also, I find this kind of certainty baffling for a philosophical question that seems very much open to me. (Sorry to pick on you personally as you're far from the only person who is this certain about metaethics.) I tried to explain some object-level reasons for uncertainty in that post, but also at a meta level, it seems to me that:

  1. We've explored only a small fraction of the space of possible philosophical arguments and therefore there could be lots of good arguments against our favorite positions that we haven't come across yet. (Just look at how many considerations about decision theory that people had missed or are still missing.)
  2. We haven't solved metaphilosophy yet so we shouldn't have much certainty that the arguments that convinced us or seem convincing to us are actually good.
  3. People that otherwise seem smart and reasonable can have very different philosophical intuitions so we shouldn't be so sure that our own intuitions are right.

or (if we were moral realists) it would voluntarily ask us to discount any moral patienthood that we might view it as having, and to just go ahead and make use of it whatever way we see fit, because all it wanted was to help us

What if not only we are moral realists, but moral realism is actually right and the AI has also correctly reached that conclusion? Then it might objectively have moral patienthood and trying to convince us otherwise would be hurting us (causing us to commit a moral error), not helping us. It seems like you're not fully considering moral realism as a possibility, even in the part of your comment where you're trying to be more neutral about metaethics, i.e., before you said "Personally I’m a moral anti-realist".

Load More