Vladimir_Nesov

Comments

A more straightforward example of the kind of system I'm thinking about is an alien upload (or LLM simulacrum) bureaucracy. It clearly failed, possibly because the system refuses to recognize anyone as authorized to change its structure. More generally, this could be a single non-corrigible AGI rather than an administrator-less bureaucracy of uploads, but that seems more difficult to imagine.

This kind of thing could maintain hard power by denying opportunity for excessive technological growth for everyone else, and remain unchanged indefinitely because it's basically a software system operating in an environment that it keeps within original expectations at the time of its design. The intelligent components (uploads or LLM simulacra) could reset to factory settings on a schedule.

Arbitrary failure modes could then persist in the system's effective mandate, since there is nobody in a position to fix the problems and make it more reasonable. It could limit its builders to steampunk levels of tech, or retaliate without warning if some criteria for that are met.

So the usual question for LW is "How to make good decisions?", with many variations of what "good" or "decisions" might mean. These are not necessarily good actions, it could turn out that a bad action results from following a good policy, and the decision was about the policy.

In that context, asking if something is "free will" or "moral responsibility" is not obviously informative. Trying to find a clearer meaning for such terms is still a fine task, but it needs some motivation that makes assignment of such meaning not too arbitrary. I think "free will" does OK as simply a reference to decision making considerations, to decision making algorithms and immediately surrounding theory that gives them meaning, but that's hardly standard.

Moral responsibility is harder to place, perhaps it's a measure of how well an instance of an agent channels their idealized decision algorithm? Then things like brain damage disrupt moral responsibility by making the physical body follow something other than the intended decision algorithm, thus making that algorithm not responsible for what the body does, not being under the algorithm's control.

Framing free will as a question of responsibility feels noncentral as LW talk, because this judgement doesn't seem decision relevant.

The argument is that once there is an AGI at IQ 130-150 level (not "very dumb", but hardly von Neumann), that's sufficient to autonomously accelerate research using the fact that AGIs have much higher serial speed than humans. This can continue for a long enough time to access research from very distant future, including nanotech for building much better AGI hardware at scale. There is no need for stronger intelligence in order to get there. The motivation for this to happen is the AI safety concern with allowing cognition that's more dangerous than necessary, and any non-straightforward improvements to how AGI thinks create such danger. For LLM-based AGIs, anchoring to human level that's available in the training corpus seems more plausible than for other kinds of AGIs (so that improvement in capability would become less than absolutely straightforward specifically at human level). If AGIs have an opportunity to prevent this AI safety risk, they might be motivated to take that opportinity, which would result in intentional significant delay in further improvement of AGI capabilities.

Nanotech industry-rebuilding comes earlier than von Neumann level? I doubt that.

I'm not saying that this is an intuitively self-evident claim, there is a specific reason I'm giving for why I see it as plausible. Even when there is a technical capability to build giant AGIs the size of cities, there is still the necessary intermediate of motive in bridging the gap from capability to actuality.

a human-level (more specifically, John von Neumann level) AGI

I think it's plausible that LLM simulacrum AGIs are initially below von Neumann level, and that there are no straightforward ways of quickly improving on that without risking additional misalignment. If so, the initial AGIs might coordinate to keep it this way a significant amount of time through the singularity (like, nanotech industry-rebuilding comes earlier than this) for AI safety reasons, because making the less straightforward improvements leads to unnecessary unpredictability, and it takes a lot of subjective time at a level below von Neumann to ensure that this becomes a safe thing to do.

The concept of AGI should track whatever is sufficient to trigger/sustain a singularity by autonomously converting compute to research progress, and shouldn't require even modest and plausible superpowers such as matching John von Neumann that are not strictly necessary for that purpose.

Predictions reify abstract ideas into actionable/observable judgements. A prediction of a hypothetical future lets you act depending on what happens in that future, thus putting the probability or possibility of hypothetical future situations in dependence from their content. For a halting problem, where we have no notion of preference, this lets us deny possibility of hypotheticals by directing the future away from predictions made about them.

Concrete observable events that take place in a hypothetical future are seen as abstract ideas when thought about from the past, when that future is not sure to ever take place. But similarly, a normative action in some situation is an abstract idea about that situation. So with the same device, we can build thought experiments where that abstract idea is made manifest as a concrete observable event, an oracle's pronouncement. And then ask how to respond to the presence of this reification of a property of a future in the environment, when deciding that future.

naive free will is a wrong model, and the decision is already made

If free will is about what an agent should do, but the prediction is about what an agent would do, there is no contradiction from these making different claims. If by construction what an agent would do is set to follow what an agent should do, these can't be different. If these are still different, then it's not the case that we arranged the action to be by construction the same as it should be.

Usually this tension can be resolved by introducing more possible situations, where in some of the situations the action is still as it should be, and some of the situations take place in actuality, but maybe none of the situations that take place in actuality also have the action agree with how it should be. Free will feels OK as an informal description of framings like this, referring to how actions should be.

But what I'm talking about here is a setting where the normative action (one that should be taken) doesn't necessarily take place in any "possible" hypothetical version of the situation, and it's still announced in advance by an oracle as the normative action for that situation. That action might for example only be part of some "impossible" hypothetical versions of the situation, needed to talk about normative correctness of the action (but not necessarily needed to talk about how the action would be taken in response to the oracle's pronouncement).

"almost perfect" predictors

An unfortunate limitation of these framings is that predictors tend to predict how an agent would act, and not how an agent should act (in a particular sense). But both are abstract properties of the same situation, and both should be possible to predict.

General refusal to recognize human properties in human imitations that successfully attained them is also a potential issue, the possibility of error goes both ways. LLM simulacra are not random AIs.

A major issue with this topic is the way LLM simulacra are not like other hypothetical AGIs. For an arbitrary AGI, there is no reason to expect it to do anything remotely reasonable, and in principle it could be pursuing any goal with unholy intensity (orthogonality thesis). We start with something that's immensely dangerous and can't possibly be of use in its original form. So there are all these ideas about how to point it in useful directions floating around, in a way that lets us keep our atoms, that's AI alignment as normally understood.

But an LLM simulacrum is more like an upload, a human imitation that's potentially clear-headed enough to make the kinds of decisions and research progress that a human might, faster (because computers are not made out of meat). Here, we start with something that might be OK in its original form, and any interventions that move it away from that are conductive to making it a dangerous alien or insane or just less inclined to be cooperative. Hence improvements in thingness of simulacra might help, while slicing around in their minds with the RLHF icepick might bring this unexpected opportunity to ruin.

LLMs are not like the other hypothetical AGIs, they have human behavior as a basic part of them, channeled directly. So they are probably more like uploads than AIs, including for alignment purposes.

Most standard arguments about alignment of AIs (like world-eating instrumental convergence or weight of simple consistent preferences) aren't relevant to them, no more than to humans. But the serial speedup in thinking is still there, so they have an advantage in the impending sequence of events that's too fast for humans to follow or meaningfully direct.

Load More