Vanessa Kosoy

Introduction To The Infra-Bayesianism Sequence

Logical induction doesn't have interesting guarantees in reinforcement learning, and doesn't reproduce UDT in any non-trivial way. It just doesn't solve the problems infra-Bayesianism sets out to solve.

Logical induction will consider a sufficiently good pseudorandom algorithm as being random.

A pseudorandom sequence is (by definition) indistinguishable from random by any cheap algorithm, not only logical induction, including a bounded infra-Bayesian.

If it understands most of reality, but not some fundamental particle, it will assume that the particle is behaving in an adversarial manor.

No. Infra-Bayesian agents have priors over infra-hypotheses. They don't start with complete Knightian uncertainty over everything and gradually reduce it. The Knightian uncertainty might "grow" or "shrink" as a result of the updates.

Vanessa Kosoy's Shortform

Universe W should still be governed by a simplicity prior. This means that whenever the agent detects a salient pattern that contradicts the assumptions of its prior shaping, the probability of W increases leading to shutdown. This serves as an additional "sanity test" precaution.

Survey Results: 10 Fun Questions for LWers

Same! (Except that I used the google random number generator)

What's a Decomposable Alignment Topic?

**Motivation**: Starting the theoretical investigation of dialogic reinforcement learning (DLRL).

**Topic**: Consider the following setting. is a set of "actions", is a set of "queries", is a set of "annotations". is the set of "worlds" defined as . Here, the semantics of the first factor is "mapping from actions to rewards", the semantics of the second factor is "mapping from queries to {good, bad, ugly}", where "good" means "query can be answered", "bad" means "query cannot be answered", "ugly" means "making this query loses the game". In addition, we are given a fixed mapping (assigning to each query its semantics). is a set of "hypotheses" which is a subset of (i.e. each hypothesis is a belief about the world).

Some hypothesis represents the user's beliefs, but the agent doesn't know which. Instead, it only has a prior . On each round, the agent is allowed to either make an annotated query or take an action from . Taking an action produces a reward and ends the game. Making a query can either (i) produce a number, which is (good), or (ii) produce nothing (bad), or (iii) end the game with zero reward (ugly).

The problem is devising algorithms for the agent, s.t., in expectation w.r.t. , the -expected reward approximates the best possible -expected reward (the latter is what we would get if the agent knew which hypothesis is correct) and the number of queries is low. Propose sets of assumptions about the ingredients of the setting that lead to non-trivial bounds. Consider proving both positive results and negative results (the latter meaning: "no algorithm can achieve a bound better than...")

**Strategy:** See the theoretical research part of my other answer. I advise to start by looking for the minimal simplification of the setting about which it is still possible to prove non-trivial results. In addition, start with bounds that scale with the sizes of the sets in question, proceed to look for more refined parameters (analogous to VC dimension in offline learning).

What's a Decomposable Alignment Topic?

**Motivation**: Improving understanding of relationship between learning theory and game theory.

**Topic**: Study the behavior of learning algorithms in mortal population games, in the limit. Specifically, consider the problem statements from the linked comment:

- Are any/all of the fixed points attractors?
- What can be said about the size of the attraction basins?
- Do all Nash equilibria correspond to fixed points?
- Do stronger game theoretic solution concepts (e.g. proper equilibria) have corresponding dynamical properties?

You can approach this theoretically (proving things) or experimentally (writing simulations). Specifically, it would be easiest to start from agents that follow fictitious play. You can then go on to more general Bayesian learners, other algorithms from the literature, or (on the experimental side) to using deep learning. Compare the convergence properties you get to those known in evolutionary game theory.

Notice that, due to the grain-of-truth problem, I intended to study this using non-Bayesian learning algorithms, but due to the ergodic-ish nature of the setting, Bayesian learning algorithms might perform well. But, if they perform poorly, this is still important to know.

**Strategies**: See my other answer.

What's a Decomposable Alignment Topic?

*The idea is an elaboration of a comment I made previously*.

**Motivation**: Improving the theoretical understanding of AGI by facilitating synthesis between algorithmic information theory and statistical learning theory.

**Topic**: Fix some reasonable encoding of communicating MDPs, and use this encoding to define : the Solomonoff-type prior over communicating MDPs. That is, the probability of a communicating MDP is proportional to where is the length of the shortest program producing the encoding of .

Consider CMDP-AIXI: the Bayes-optimal agent for . Morally speaking, we would like to prove that CMDP-AIXI (or any other policy) has a frequentist (i.e. per hypothesis ) non-anytime regret bound of the form , where is the time horizon^{[1]}, is a parameter such as MDP diameter, bias span or mixing time, , (this time is just a constant, *not* time discount). However, this precise result is probably impossible, because the Solomonoff prior falls off very slowly. *Warm-up*: Prove this!

Next, we need the concept of "sophisticated core", inspired by algorithmic statistics. Given a bit string , we consider the Kolmogorov complexity of . Then we consider pairs where is a program that halts on all inputs, is a bit string, and . Finally, we minimize over . The minimal is called the *sophistication* of . For our problem, we are interested in the minimal itself: I call it the "sophisticated core" of and denote it .

To any halting program we can associate the environment . We also define the prior by . and are "equivalent" in the sense that . However, they are not equivalent for the purpose of regret bounds.

*Challenge*: Investigate the conjecture that there is a (-dependent) policy satisfying the regret bound for every , or something similar.

**Strategy**: See the theoretical research part of my other answer.

I am using unnormalized regret and step-function time discount here to make the notation more standard, even though usually I prefer normalized regret and geometric time discount. ↩︎

What's a Decomposable Alignment Topic?

*The idea is an elaboration of a comment I made previously.*

**Motivation**: Improving our understanding of superrationality.

**Topic**: Investigate the following conjecture.

Consider two agents playing iterated prisoner's dilemma (IPD) with geometric time discount. It is well known that, for sufficiently large discount parameters (), essentially all outcomes of the normal form game become Nash equilibria (the folk theorem). In particular, cooperation can be achieved via the tit-for-tat strategy. However, defection is still a Nash equilibrium (and even a subgame perfect equilibrium).

Fix . Consider the following IPD variant: the first player is forced to play a strategy that can be represented by a finite state automaton of states, and the second player is forced to play a strategy that can be represented by a finite state automaton of states. For our purpose a "finite state automaton" consists of a set of states , the transition mapping and the "action mapping" . Here, tells you how to update your state after observing the opponent's last action, and tells you which action to take. Denote the resulting (normal form) game , where is the time discount parameter.

*Conjecture:* If then there are a functions and s.t. the following conditions hold:

- Any thermodynamic equilibrium of of temperature has the payoffs of up to .

**Strategies:** You could take two approaches: theoretical research and experimental research.

For theoretical research, you would try to prove or disprove the conjecture. If the initial conjecture is too hard, you can try to find easier variants (such as , or adding more constraints on the automaton). If you succeed proving the conjecture, you can go on to studying games other than prisoner's dilemma (for example, do we always converge to Pareto efficiency?) If you succeed in disproving the conjecture, you can go on to look for variants that survive (for example, assume or that the finite state automatons must not have irreversible transitions).

To decompose the task I propose: (i) have each person in the team think of ideas how to approach this (ii) brainstorm everyone's ideas together and select a subset of promising ideas (iii) distribute the promising ideas among people and/or take each promising idea and find multiple lemmas that different people can try proving.

Don't forget to check whether the literature has adjacent results. This also helps decomposing: the literature survey can be assigned to a subset of the team, and/or different people can search for different keywords / read different papers.

For experimental research, you would code an algorithm that computes the thermodynamic equilibria, and see how the payoffs behave as a function of and . Optimally, you would also provide a derivation of the error bounds on your results. To decompose the task, use the same strategy as in the theoretical case to come up with the algorithms and the code design. Afterwards, decompose it by having each person implement a segment of the code (pair programming is also an option).

It is also possible to go for theoretical *and* experimental simultaneously, by distributing among people and cross-fertilizing along the way.

What's a Decomposable Alignment Topic?

What sort of math background can we assume the group to have?

Mesa-Search vs Mesa-Control

Well, running a Turning machine for time can be simulated by a circuit of size , so in terms of efficiency it's much closer to "doing search" than to "memorizing the output of search".

A summary of my current breakdown of the problem of traps into subproblems and possible paths to solutions. Those subproblems are different but different but related. Therefore, it is desirable to not only solve each separately, but also to have an elegant synthesis of the solutions.

Problem 1:In the presence of traps, Bayes-optimality becomes NP-hard even on the weakly feasible level (i.e. using the number of states, actions and hypotheses as security parameters).Currently I only have speculations about the solution. But, I have a few desiderata for it:

Desideratum 1a:The algorithm should guaranteesomelower bound on expected utility, compared to what the Bayes-optimal policy gets. We should also have an upper bound for all polynomial time algorithms. The two bounds should not be too far apart.Desideratum 1b:When it so happens we have no traps, the algorithm should produce asymptotic Bayes optimality with a regret bound close enough to optimal. When there are only "small" traps, the penalty should be proportional.Problem 2:: In the presence of traps, there is no "frequentist" guarantee (regret bound). We can divide it into subproblems according to different motivations for having such a guarantee in the first place.Problem 2a:We want such a guarantee as a certificate of safety.Solution: Require a subjective regret bound instead.Problem 2b:The guarantee is motivated by an "evolutionary" perspective on intelligence: intelligent agents are agents that are successful in the real world, not just in average over all possible worlds.Solution:Bootstrapping from a safe baseline policy. For an individual human, the baseline comes from knowledge learned from other people. For human civilization, some of the baseline comes from inborn instincts. For human civilization and evolution both, the baseline comes from locality and thermodynamics: doing random things is unlikely to cause global irreversible damage. For an aligned AI, the baseline comes from imitation learning and quantilization.Problem 2c:The guarantee is needed to have a notion of "sample complexity", which is such an important concept that it's hard to imagine deconfusion without it.Solution:Some notion of sample complexity might come from the guarantee we expect in Desideratum 1a. But there is another perspective, interesting even without traps:A prior consists of a space H of hypotheses and a probability measure ζ over this space. We also have a mapping ρ:H→E where E is the space of environments, which provides semantics to the hypotheses. Bayes-optimizing ζ means Bayes-optimizing the environment ζ⋆:=Eh∼ζ[ρ(h)]. Learnability of ζ means that the Bayesian regret Rg(γ):=Eh∼ζ[V(ρ(h),γ)]−V(ζ⋆,γ) must converge to 0 as γ goes to 1. Here V(μ,γ) is the (normalized to [0,1]) value (maximal expected utility) of environment μ at time discount γ. Notice that the second term depends only on ζ⋆ but the first term depends on ζ and ρ. Therefore, we can ask about the regrets for different decompositions of the same ζ⋆ into hypotheses. For some H′, ζ′∈ΔH′ and ρ′:H′→E s.t. ζ⋆=Eh∼ζ′[ρ′(h)], we can have learnability even when we don't have it for the original decomposition. I think that typically there will be many such decompositions. They live in the convex set surrounding ζ⋆ in which the value function becomes affine in the γ→1 limit. We can say that not all information is learnable, but ζ′ represents

somelearnable information. We can then study the regret bound (and thus) sample complexity for a particular ζ′ or for all possible ζ′.