Vanessa Kosoy

AI alignment researcher supported by HUJI, MIRI and LTFF. Working on the learning-theoretic agenda.

E-mail: vanessa DOT kosoy AT {the thing reverse stupidity is not} DOT org

IMO it might very well be that most restrictions on data and compute are net positive. However, there are arguments in both directions.

On my model, current AI algorithms are missing some key ingredients for AGI, but they might still eventually produce AGI by *learning* those missing ingredients. This is similar to how biological evolution a learning algorithm which is not a GI, but it produced humans who are GIs. Such an AGI would be a mesa-optimizer, and it's liable to be unaligned regardless of the details of the outer loop (assuming an outer loop made of building blocks similar to what we have today). For example, the outer loop might be aimed at a human imitation, but the resulting mesa-optimizer is only imitating humans when it's instrumentally beneficial for it. Moreover, as in the case of evolution, this process would probably be very costly in terms of compute and data, as it is trying to "brute force" a problem for which it doesn't have an efficient algorithm. Therefore, limiting compute or data seems like a promising way to prevent this undesirable scenario.

On the other hand, the most likely path to *aligned* AI would be through a design that's based on solid theoretical principles. Will such a design require much data or compute compared to unaligned competitors?

Reasons to think it won't:

- Solid theoretical principles should allow improve capabilities as well as alignment.
- Intuitively, if an AI is capable enough to be transformative (given access to particular amounts of compute and data), it should be capable enough to figure out human values, assuming it is motivated to do so at the first place. Or, it should at least be capable enough to act against unaligned competition while not irreversibly destroying information about human values (in which case it can catch up on learning those later). This is similar to what Christiano calls "strategy stealing".

Reasons to think it will:

- Maybe aligning AI requires installing safe-guards that cause substantial overhead. This seems very plausible when looking at proposals such as Delegative Reiforcement Learning, which have worst regret asymptotic that "unaligned" alternatives (conventional RL). It also seems plausible when looking at proposals such as IDA or debate, which introduce another level of indirection (simulating humans) to the problem of optimizing the world that unaligned AI attacks directly (in Christiano's terminology, they fail to exploit inaccessible information). It's less clear about PreDCA, but even there alignment requires a loss function with more complex type signature than the infra-Bayesian physicalism "default", which might incur a statistical or computational penalty.
- Maybe aligning AI requires restricting ourselves to using well-understood algorithmic building blocks and not heuristic (but possibly more efficient) building blocks. Optimistically, having solid theoretic principles should allow us to roughly predict the behavior even of heuristic algorithms that are effective (because such algorithms have to be doing qualitatively the same thing as the rigorous algorithms). Pessimistically, alignment might depend on nuances that are obscured in heuristics.

We can model the situation by imagining 3 frontiers in resource space:

- The mesa-resource-frontier (MRF) is how much resources are needed to create TAI with something similar to modern algorithms, i.e. while still missing key AGI ingredients (which is necessarily unaligned).
- The direct-resource-frontier (DRF) is how much resources are needed to create TAI assuming all key algorithms, but without any attempt at alignment.
- The aligned-resource-frontier (ARF) is how much resources are needed to create aligned TAI.

We have ARF > DRF and MRF > DRF, but the relation between ARF and MRF is not clear. They might even intersect (resource space is multidimensional, we at least have data vs compute and maybe finer distinctions are important). I would still guess MRF > ARF, by and large. Assuming MRF > ARF > DRF, the ideal policy would forbid resources beyond MRF but allow resources beyond ARF. A policy that is too lax might lead to doom by the mesa-optimizer pathway. A policy that is too strict might lead to doom by making alignment infeasible. If the policy is so strict that it forces us below DRF then it buys time (which is good), but if the restrictions are then lifted *gradually*, it predictably leads to the region between DRF and ARF (which is bad).

Overall, the conclusion is uncertain.

There are two operations involved in the definition of : pullback and tensor product.

Pullback is defined for arbitrary bundles. Given a mapping (these and are arbitrary manifolds, not the specific ones from before) and a bundle over with total space and projection mapping , the pullback of w.r.t. (denoted ) is the bundle over with total space and the obvious projection mapping. I remind that is the fibre product, i.e. the submanifold of defined by . Notice that the fibre of over any is *canonically* isomorphic to the fibre of over . The word "canonical" means that there is a *particular* isomorphism that we obtain from the construction.

It is easy enough to see that the pullback of a vector bundle is a vector bundle, the pullback of a line bundle is a line bundle, and the pullback of a Hermitian vector bundle is a Hermitian vector bundle.

Tensor product is an operation over vector bundles. There are different ways to define it, corresponding to the different ways to define a tensor product of vector spaces. Specifically for line bundles there is the following shortcut definition. Let and be line bundles over . Then, the total space of is the quotient of by the equivalence relation given by: iff . Here, I regard as vectors in the vector space which is the corresponding fibre fo and similarly for and . The quotient of a manifold by an equivalence relation is not always a manifold, but in this case it is.

I notice that you wrote "a particular fiber is isomorphic to ". Your error here is, it doesn't matter what it's *isomorphic* to, you should still think of it as an abstract vector space. So, if e.g. and are 1-dimensional vector spaces, then is yet another "new" vector space. Yes, they are all isomorphic, but they are not canonically isomorphic.

Your guess is exactly what I meant. The is outside the product, otherwise this expression is not even a valid group action.

Now, about bundles.

As you said, a bundle over a manifold is another manifold with a projection s.t. locally it looks like a product. Formally, every should have an open neighborhood s.t. there is a diffeomorphism between restricted to and a projection for some manifold (the "fiber").

A *vector bundle* is a bundle equipped with additional structure that makes every fiber a vector space. Formally, we need to have a smooth addition mapping and a multiplication-by-scalar mapping which are (i) morphisims of bundles and (ii) make every fiber (i.e. the inverse -image of every point in ) into a vector space. Here, stands for the fibre product (the submanifold of given by ). I'm using here because we will need complex vector bundles.

A *line bundle* is just a vector bundle s.t. every fiber is 1-dimensional.

A *Hermitian* vector bundle is a vector bundle equipped with a smooth mapping of bundles which makes every fibre into an inner product space.

Onward to quantum mechanics. Let be physical space and physical spacetime. In the non-relativistic setting, is isomorphic to , so all Hermitian line bundles over are isomorphic. So, in principle any one of them can be identified with the trivial bundle: total space with being the canonical projection. However, it is often better to imagine some Heremitian line bundle without such an identification. In fact, choosing an identification precisely corresponds to choosing a gauge. This is like how all finite dimensional real vector spaces are isomorphic to but it is often better not to fix a particular isomorphism (basis), because that obscures the underlying symmetry group of the problem. For finite dimensional vector spaces, the symmetry group is the automorphisms of the vector space (a group isomorphic to ), for bundles it is the automorphism group of the bundle (= the group of gauge of transformations).

So, let's fix a Hermitian line bundle on . This allows constructing a Hermitian line bundle on (where is the number of particles) using the equation I gave before. That equation involves the operations of tensor product and pullback-by-mapping for bundles. I can explain, but maybe you can guess how they are defined (just imagine what it should do to every fibre, and then there is only one reasonable way to "glue" it together). If we fix an isomorphism between and the trivial bundle over (=gauge) then it induces an isomorphism between and the trivial bundle over . In this picture, saying that is a section of amounts to saying it is a mapping which is compatible with the projection. The latter condition just means it is the identity on the component of the output, so all the information is in the component on the output, reducing it to a mapping .

This way, in every particular gauge the wavefunction is just a complex function, but there is a sense in which it is better to avoid fixing a gauge and think of the wavefunction as a section of the somewhat abstract bundle . Just like a vector in a finite dimensional vector space can be thought of as a column of numbers, but often it's better to think of it as just an abstract vector.

You don't need QFT here, gauge invariance is a thing even for non-relativistic quantum charged particles moving in a background electromagnetic field. The gauge transformation group consists of (sufficiently regular) functions . The transformation law of the -particle wavefunction is:

Here, is the electric charge of the -th particle, in units of positron charge.

In math-jargony terms, the wavefunction is a section of the line bundle

Here, is the projection to the position of the -th particle and is the "standard" line bundle on on which the electromagnetic field (the 4-potential , which is useful here even though the setting is non-relativistic) is a connection. has an induced connection, and the electromagnetic time-dependent Shroedinger equation is obtained from the ordinary time-dependent Shroedinger equation by replacing ordinary derivatives with covariant derivatives.

Other people have noted that Solomonoff log-probability differs from Kolmogorov complexity only by a constant. But there's another similar pair of objects I'm interested in, where I don't know whether the analogous claim holds. Namely, in my original definition of the AIT intelligence measure, I used Kolmogorov complexity, because I implicitly assumed it's the same as Solomonoff log-probability up to a constant. But Alex questioned this claim, which is why I switched to Solomonoff log-probability when writing about the physicalist version (see Definition 1.6 here). The crucial difference between this and the question in the OP is, we're looking at programs selected by Solomonoff-expectation of something-to-do-with-their-ouput, rather than directly by their output (which places us on different spots on the computability ladder). If the two are different then I'm pretty sure Solomonoff log-probability is the correct one, but are they? I would be very interested to know.

Personally, I feel that I want to be pretty as a goal in itself *and* in order to be attractive to straight men (and to other gynosexual people). I suspect women^{[1]} have an evolved intrinsic desire to look pretty because in the ancestral environment it increased your genetic fitness to look pretty. To give an analogy, we eat both because (i) we are hungry and also food tastes good, *and* because (ii) we know we need to eat to survive (e.g. if you're sick and have no appetite you sometimes force yourself to eat) and we need particular types of food to stay healthy. A single activity can be motivated by a mixture of terminal and instrumental goals.

And other genders too, but for women it's more pronounced, on average. ↩︎

The post is still largely up-to-date. In the intervening year, I mostly worked on the theory of regret bounds for infra-Bayesian bandits, and haven't made much progress on open problems in infra-Bayesian physicalism. On the other hand, I also haven't found any new problems with the framework.

The strongest objection to this formalism is the apparent contradiction between the monotonicity principle and the sort of preferences humans have. While my thinking about this problem evolved a little, I am still at a spot where every solution I know requires biting a strange philosophical bullet. On the other hand, IBP is still my best guess about naturalized induction, and, more generally, about the conjectured "attractor submanifold" in the space of minds, i.e. the type of mind to which all sufficiently advanced minds eventually converge.

One important development that did happen is my invention of the PreDCA alignment protocol, which critically depends on IBP. I consider PreDCA to be the most promising direction I know at present to solving alignment, and an important (informal) demonstration of the potential of the IBP formalism.

It should be . More generally, there is the notion of support from measure theory, which sometimes comes up, although in this post we only work with finite sets so it's the same.

First, the notation makes no sense. The prior is over hypotheses, each of which is an element of . is the notation used to denote a single hypothesis.

Second, having a prior just over doesn't work since both the loss function and the counterfactuals depend on .

Third, the reason we don't just start with a prior over , is because it's important *which* prior we have. Arguably, the correct prior is the image of a simplicity prior over physicalist hypotheses by the bridge transform. But, come to think about it, it might be about the same as having a simplicity prior over , where each hypothesis is constrained to be *invariant* under the bridge transform (thanks to Proposition 2.8). So, maybe we can reformulate the framework to get rid of (but not of the bridge transform). Then again, finding the "ultimate prior" for general intelligence is a big open problem, and maybe in the end we will need to specify it with the help of .

Fourth, I wouldn't say that is supposed to solve the ontology identification problem. The way IBP solves the ontology identification problem is by asserting that is the correct ontology. And then there are tricks how to translate between other ontologies and this ontology (which is what section 3 is about).

The syntax L⊗q means "L to the tensor power of q". For q>0, it just means tensoring L with itself q times. For q=0, L⊗q is just the trivial line bundle with total space Y×C (and, yes, all line bundles are isomorphic to the trivial line bundle, but this one just

isthe trivial bundle... or at least,canonicallyisomorphic to it). For q<0, we need the notion of adualvector bundle. Any vector bundle V has a dual V∗, and for a line bundle the dual is also the inverse, in the sense that L⊗L∗ iscanonicallyisomorphic to the trivial bundle. We can then define all negative powers by L⊗q:=(L∗)⊗−q. Notice that non-negative tensor powers are defined for all vector bundles, but negative tensor powers only make sense for line bundles.It remains to explain what is V∗. But, for our purposes we can take a shortcut. The idea is, for any finite-dimensional complex vector space U with an inner product, there is a canonical isomorphism between U∗ and ¯U, where ¯U is the complex-conjugate space. What is the complex-conjugate space? It is a vector space that (i) has the same set of vectors (ii) has the same addition operation and (iii) has its multiplication-by-scalar operation modified, so that multiplying u by z in ¯U is the same thing as multiplying u by ¯z in U, where ¯z is just the complex number conjugate to z.

Equipped with this observation, we can define the dual of a

Hermitianline bundle L to be ¯L, where ¯L is the bundle obtained for L by changing its multiplication-by-scalar mapping in the obvious way.