Vanessa Kosoy

AI alignment researcher supported by MIRI and LTFF. Working on the learning-theoretic agenda. Based in Israel. See also LinkedIn.

E-mail: vanessa DOT kosoy AT {the thing reverse stupidity is not} DOT org

Wiki Contributions


First, I clicked the link in the second poll[1]. My thought process looked as follows:

  • I quickly skimmed the content of the message
  • My split-second-judgement registered that there is a RACE
  • Moreover, the race is on very small time scales: every second of indecision might cost me victory!
  • Moreover, split-second-judgment estimates that winning the race is good-in-expectation (where "expectation" should be thought of as including the "logical uncertainty" resulting from having to rely on split-second-judgement).
  • Therefore, click NOW before it's too late!

Worse, even after clicking and reading the text again, I misunderstood its content. Somehow, I thought that this year's celebration will be determined by the plurality, whereas next year's will be determined by the fastest minority. This system is strange, but is not obviously defect-y, i.e. not obviously inferior to e.g. using plurality twice in a row, from behind the veil-of-ignorance.

Only after reading the OP and starting composing this comment in my mind, did I understand the actual meaning of the text in the second poll: that only the next year's celebration is decided upon, and only according to a minority (if anyone in a minority clicks). Now, this is more or less clearly defect-y and in hindsight I don't endorse clicking it.

What is my take-away lesson? The process I used to make the decision seems correct to me: if you have to make a split-second decision, then you need to use your split-second judgement because there is nothing else to go by. There might be some case for a bias towards inaction, but it's not an overwhelming case. Personally, I know that I'm usually too slow to respond in emergency scenarios, so I don't want to train myself to prefer inaction.

The right way to optimize this is to train your split-second judgement to do well in the sort of situations in which split-second judgement is likely to be required. The sort of reasoning required of us here is not likely to be tied to a split-second decision anywhere outside of Petrov Day games[2], so I think my split-second judgement did as well as expected and there's nothing to correct.

[EDIT: Actually, there is a correction to be made here, and it refers to my wrong reading of the message after clicking the link. The lesson is: if I make a split-second decision, I need to carefully reexamine it after the fact, in order to understand its true consequences, and beware of anchoring on my split-second reasoning: this anchoring is probably motivated by wanting to justify myself later.]

Second, I think that going with the majority in this case is not honoring your word. You explicitly said "the first to do so out of any minority group". If you break your word and go with the majority, I won't completely lose my trust in you: but that's mostly because this is a game. In a situation with more serious stakes, I expect you to take the precise meaning of your promises way more seriously, and I would be extremely disappointed if you don't.

Third, I think this was a cool way to celebrate Petrov's Day (modulo the issue with breaking your word, which is really bad and must not be repeated). Kudos!

  1. ^

    My choice in the first poll was "accurately reporting your epistemic state".

  2. ^

    The actual Petrov had more time to make his decision, and also if I got Petrov's job I would train my fast-judgement on Petrov-like situations in advance.

I'm going to be in London October 14-17. If anyone will be in the vicinity and wants to meet, hit me up!

Maybe the Effective Altruist movement should accept people like you because they’re a big tent and they’re friendly and welcoming, but the rationalist community should be elitist and only accept people who say tsuyoku naritai...


This is a disturbing claim, although I realize that the author's opinions don't coincide with those of the "Alice" character. Personally, I'm not a utilitarian, nor do I want to be a utilitarian or think that I "should" be a utilitarian[1]. I do consider myself a person who is empathetic, honest and cooperative[2]. I hope this doesn't disqualify me from the rationalist community?

In general, I'm in favor of promoting societal norms which incentivize making the world better: such norms are obviously in everyone's interest. In this sense, I'm very sympathetic to effective altruism. However, these norms should still regard altruism as supererogatory: i.e., it should be rewarded and encouraged, but it's lack should not be severely punished. The alternative is much too vulnerable to abuse.

  1. ^

    IMO utilitarianism is not even logically coherent, due to paradoxes with infinite ethics and Pascal's mugging.

  2. ^

    In the sense of, trying to act according to superrationality.

...the problem of how to choose one's IBH prior. (If the solution was something like "it's subjective/arbitrary" that would be pretty unsatisfying from my perspective.)


It seems clear to me that the prior is subjective. Like with Solomonoff induction, I expect there to exist something like the right asymptotic for the prior (i.e. an equivalence class of priors under the equivalence relation where  and  are equivalent when there exists some  s.t.  and ), but not a unique correct prior, just like there is no unique correct UTM. In fact, my arguments about IBH already rely on the asymptotic of the prior to some extent.

One way to view the non-uniqueness of the prior is through an evolutionary perspective: agents with prior  are likely to evolve/flourish in universes sampled from prior , while agents with prior  are likely to evolve/flourish in universes sampled from prior . No prior is superior across all universes: there's no free lunch.

For the purpose of AI alignment, the solution is some combination of (i) learn the user's prior and (ii) choose some intuitively appealing measure of description complexity, e.g. length of lambda-term (i is insufficient in itself because you need some ur-prior to learn the user's prior). The claim is, different reasonable choices in ii will lead to similar results.

Given all that, I'm not sure what's still unsatisfying. Is there any reason to believe something is missing in this picture?

...I'm still comfortable sticking with "most are wide open".


Allow me to rephrase. The problems are open, that's fair enough. But, the gist of your post seems to be: "Since coming up with UDT, we ran into these problems, made no progress, and are apparently at a dead end. Therefore, UDT might have been the wrong turn entirely." On the other hand, my view is: Since coming up with those problems, we made a lot of progress on agent theory within the LTA, which has implications on those problems among other things, and so far this progress seems to only reinforce the idea that UDT is "morally" correct. That is, not that any of the old attempted formalizations of UDT is correct, but that the intuition behind UDT, and its recommendation in many specific scenarios, are largely justified.

ETA: Oh, I think you're saying that the CDT agent could turn into a IBH agent but with a different prior from the other IBH agents, that ends up allowing it to still play D while the other two still play C, so it's not made worse off by switching to IBH. Can you walk this through in more detail? How does the CDT agent choose what prior to use when switching to IBH, and how do the different priors actual imply a CCD outcome in the end?

While writing this part, I realized that some of my thinking about IBH was confused, and some of my previous claims were wrong. This is what happens when I'm overeager to share something half-baked. I apologize. In the following, I try to answer the question while also setting the record straight.

An IBH agent considers different infra-Bayesian hypotheses starting from the most optimistic ones (i.e. those that allow guaranteeing the most expected utility) and working its way down, until it finds something that works[1]. Such algorithms are known as "upper confidence bound" (UCB) in learning theory. When multiple IBH agents interact, they start with each trying to achieve its best possible payoff in the game[2], and gradually relax their demands, until some coalition reaches a payoff vector which is admissible for it to guarantee. This coalition then "locks" its strategy, while other agents continue lowering their demands until there is a new coalition among them, and so on.

Notice that the pace at which agents lower their demands might depend on their priors (by affecting how many hypotheses they have to cull at each level), their time discounts and maaaybe also other parameters of the learning algorithm. Some properties this process has:

  • Every agents always achieves at least its maximin payoff in the end. In particular, a zero-sum two-player game ends in a Nash equilibrium.
  • If there is a unique strongly Pareto-efficient payoff (such as in Hunting-the-Stag), the agents will converge there.
  • In a two-player game, if the agents are similar enough that it takes them about the same time to go from optimal payoff to maximin payoff, the outcome is strong Pareto-efficient. For example, in a Prisoner's Dilemma they will converge to player A cooperating and player B cooperating some of the time and possibly defecting some of the time, such that player A's payoff is still strictly better than DD. However, without any similarity assumption, they might instead converge to an outcome where one player is doing its maximin strategy and the other its best response to that. In a Prisoner's Dilemma, that would be DD[3].
  • In a symmetric two-player game, with very similar agents (which might still have independent random generators), they will converge to the symmetric Pareto efficient outcome. For example, in a Prisoner's Dilemma they will play CC, whereas in Chicken [version where flipping coin is better than both swerving] they will "flip a coin" (e.g. alternative) to decide who goes straight and who swerves. 
  • The previous bullet is not true with more than two players. There can be stochastic selection among several possible points of convergence, because there are games in which different mutually exclusive coalitions can form. Moreover, the outcome can fail to be Pareto efficient, even if the game is symmetric and the agents are identical (with independent random generators).
  • Specifically in Wei Dai's 3-player Prisoner's Dilemma, IBH among identical agents always produces CCC. IBH among arbitrarily different agents might produce CCD (if one player is very slow to lower its demands, while the other other two lower their demands in the same, faster, pace), or even DDD (if each of the players lowers its demands on its own very different timescale).

We can operationalize "CDT agent" as e.g. a learning algorithm satisfying an internal regret bound (see sections 4.4 and 7.4 in Cesa-Bianchi and Lugosi) and the process of self-modification as learning on two different timescales: a slow outer loop that chooses a learning algorithm for a quick inner loop (this is simplistic, but IMO still instructive). Such an agent would indeed choose IBH over CDT if playing a Prisoner's Dilemma (and would prefer an IBH variant that lowers its demands slowly enough to get more of the gains-of-trade but quickly enough to actually converge), whereas in the 3-player Prisoner's Dilemma there is at least some IBH variant that would be no worse than CDT.

If all players have metalearning in the outer loop, then we get dynamics similar to Chicken [version in which both swerving is better than flipping a coin[4]], where hard-bargaining (slower to lower demands) IBH corresponds to "straight" and soft-bargaining (quick to lower demands) IBH corresponds to "swerve". Chicken [this version] between two identical IBH agents results in both swerving. Chicken beween hard-IBH and soft-IBH results in hard-IBH getting a higher probability of going straight[5]. Chicken between two CDTs results in a correlated equilibrium, which might have some probability of crashing. Chicken between IBH and CDT... I'm actually not sure what happens off the top of my head, the analysis is not that trivial.


  1. ^

    This is pretty similar to "modal UDT" (going from optimistic to pessimistic outcomes until you find a proof that some action can guarantee that outcome). I think that the analogy can be made stronger if the modal agent uses an increasingly strong proof system during the search, which IIRC was also considered before. The strength of the proof system then plays the role of "logical time", and the pacing of increasing the strength is analogous to the (inverse function of the) temporal pacing in which an IBH agent lowers its target payoff.

  2. ^

    Assuming that they start out already knowing the rules of the game. Otherwise, they might start from trying to achieve payoffs which are impossible even with the cooperation of other players. So, this is a good model if learning the rules is much faster than learning anything to do with the behavior of other players, which seems like a reasonable assumption in many cases.

  3. ^

    It is not that surprising that two sufficiently dissimilar agents can defect. After all, the original argument for superrational cooperation was: "if the other agent is similar to you, then it cooperates iff you cooperate".

  4. ^

    I wish we had good names for the two version of Chicken.

  5. ^

    This seems nicely reflectively consistent: soft/hard-IBH in the outer loop produces soft/hard-IBH respectively in the inner loop. However, two hard hard-IBH agents in the outer loop produce two soft-IBH agents in the inner loop. On the other hand, comparing absolute hardness between outer and inner loop seems not very meaningful, whereas comparing relative-between-players hardness between outer and inner loop is meaningful.

I'll start with Problem 4 because that's the one where I feel closest to the solution. In your 3-player Prisoner's Dilemma, infra-Bayesian hagglers[1] (IBH agents) don't necessarily play CCC. Depending on their priors, they might converge to CCC or CCD or other Pareto-efficient outcome[2]. Naturally, if the first two agents have identical priors then e.g. DCC is impossible, but CCD still is. Whereas, if all 3 have the same prior they will necessarily converge to CCC. Moreover, there is no "best choice of prior": different choices do better in different situations.

You might think this non-uniqueness is evidence of some deficiency of the theory. However, I argue that it's unavoidable. For example, it's obvious that any sane decision theory will play "swerve" in a chicken game against a rock that says "straight". If there was an ideal decision theory X that lead to a unique outcome in every game, the outcome of X playing chicken against X would be symmetric (e.g. flipping a shared coin to decide who goes straight and who swerves, which is indeed what happens for symmetric IBH[3]). This leads to the paradox that the rock is better than X in this case. Moreover, it should really be no surprise that different priors are incomparable, since this is the case even when considering a single learning agent: the higher a particular environment is in your prior, the better you will do on it.

Problems 1,3,6 are all related to infra-Bayesian physicalism (IBP).

For Problem 1, notice that IBP agents are already allowed some sort of "indexical" values. Indeed, in section 3 of the original article we describe agents that only care about their own observations. However, these agents are not truly purely indexical, because when multiple copies co-exist, they all value each other symmetrically. In itself, I don't think this implies the model doesn't describe human values. Indeed, it is always sensible to precommit to care about your copies, so to the extent you don't do it, it's a failure of rationality. The situation seems comparable with hyperbolic time discount: both are value disagreements between copies of you (in the time discount case, these are copies at different times, in the anthropic case, these are copies that co-exist in space). Such a value disagreement might be a true description of human psychology, but rational agents should be able to resolve it via internal negotiations, converging to a fully coherent agent.

However, IBP also seems to implies the monotonicity problem, which is a much more serious problem, if we want the model to be applicable to humans. The main possible solutions I see are:

  1. Find some alternative bridge transform which is not downwards closed but still well-behaved and therefore doesn't imply a monotonicity principle. That wouldn't be terribly surprising, because we don't have an axiomatic derivation of the bridge transform yet: it's just the only natural object we found so far which satisfies all desiderata.
  2. Just admit humans are not IBP agents. Instead, we might model them e.g. as cartesian IBRL agents. Maybe there is a richer taxonomy of intermediate possibilities between pure cartesianism and pure physicalism. Notice that this doesn't mean UDT is completely inapplicable to humans: cartesian IBRL already shows UDT-ish behavior in learnable pseudocausal Newcombian problems and arguably multi-agent scenarios as well (IBH). Cartesian IBRL might depart from UDT in scenarios such as fully acausal trade (i.e. trading with worlds where the agent never existed).
    1. This possibility is not necessarily free of bizarre implications. I suspect that cartesian agents always end up believing in some sort of simulation hypothesis (due to reasons such as  Christiano 2016). Arguably, they should ultimately converge to IBP-like behavior via trade with their simulators. What this looks like in humans, I dare not speculate.
  3. Swallow some bizarre philosophical bullet to reconcile human values with the monotonicity principle. The main example is, accept that worst-than-death qualia don't matter, or maybe don't exist (e.g. people that apparently experience them are temporarily zombies) and that among several copies of you, only the best-off copies matters. I don't like this solution at all, but I still feel compelled to keep a (very skeptical) eye on it for now.

For Problem 3, IBP agents have perfectly well-defined behavior in anthropic situations. The only "small" issue is that this behavior is quite bizarre. The implications depend, again, on how you deal with monotonicity principle.

If we accept Solution 1 above, we might end up with a situation where anthropics devolves to preferences again. Indeed, that would be the case if we allowed arbitrary non-monotonic loss functions. However, it's possible that the alternative bridge transform would impose a different effective constraint on the loss function, which would solve anthropics in some well-defined way which is more palatable than monotonicity.

If we accept Solution 2, then anthropics seems at first glance "epiphenomenal": you can learn the correct anthropic theory empirically, by observing which copy you are, but the laws of physics don't necessarily dictate it. However, under 2a anthropics is dictated by the simulators, or by some process of bargaining with the simulators.

If we accept Solution 3... Well, then we just have to accept how IBP does anthropics off-the-bat.

For Problem 6, it again depends on the solution to monotonocity.

Under Solutions 1 & 3, we might posit that humans do have something like "access to source code" on the unconscious level. Indeed, it seems plausible that you have some intuitive notion of what kind of mind should be considered "you". Alternatively (or in addition), it's possible that there is a version of the IBP formalism which allows uncertainty over your own source code.

Under Solution 2 there is no problem: cartesian IBRL doesn't require access to your own source code.

  1. ^

    I'm saying "infra-Bayesian hagglers" rather than "infra-Bayesian agents" because I haven't yet nailed the natural conditions a learning-algorithm needs to satisfy to enable IBH. I know some examples that do, but e.g. just satisfying an IB regret bound is insufficient. But, this should be thought of as a placeholder for some (hopefully) naturalized agent desiderata.

  2. ^

    It's not always Pareto efficient, see child comment for more details.

  3. ^

    What if there is no shared coin? I claim that, effectively, there always is. In a repeated game, you can e.g. use the parity of time as the "coin". In a one-shot game, you can use the parity of logical time (which can be formalized using metacognitive IB agents).

The way I see it, all of these problems are reducible to (i) understanding what's up with the monotonicity principle in infra-Bayesian physicalism and (ii) completing a new and yet unpublished research direction (working title: "infra-Bayesian haggling") which shows that IB agents converge to Pareto efficient outcomes[1]. So, I wouldn't call them "wide open".

  1. ^

    Sometimes, but there are assumptions, see child comment for more details.

If a particular code doesn't talk about e.g. non-disparagement agreements, or talks about them in some confusing, unclear way, then people will notice. The point of having a template is precisely drawing attention to what is expected to be there (in particular when it's not there).

Also, I think we should really strive to be better than for-profit companies (see non-disparagement agreements again).

A thought inspired by this thread. Maybe we should have a standard template for a code of conduct for organizations, that we will encourage all rational-sphere and EA orgs to endorse. This template would include, never making people sign non-disparagement agreements (and maybe also forbidding other questionable practices that surfaced in recent scandals). Organizations would be encouraged to create their own codes based on the template and commit to them publicly (and maybe even in some legally binding manner). This flexibility means we don't need a 100% consensus about what has to be in the code, but also if e.g. a particular org decides to remove a particular clause, that will be publicly visible and salient.

First, I think that the theory of agents is a more useful starting point than metaphilosophy. Once we have a theory of agents, we can build models, within that theory, of agents reasoning about philosophical questions. Such models would be answers to special cases of metaphilosophy. I'm not sure we're going to have a coherent theory of "metaphilosophy" in general, distinct from the theory of agents, because I'm not sure that "philosophy" is an especially natural category[1].

Some examples of what that might look like:

  • An agent inventing a theory of agents in order to improve its own cognition is a special case of recursive metalearning (see my recent talk on metacognitive agents).
  • There might be theorems about convergence of learning systems to agents of particular type (e.g. IBP agents), formalized using some brand of ADAM, in the spirit of John's Selection Theorems programme. This can be another model of agents discovering a theory of agents and becoming more coherent as a result (broader in terms of its notions of "agent" and "discovering" and narrower in terms of what the agent discovers).
  • An agent learning how to formalize some of its intuitive knowledge (e.g. about its own values) can be described in terms of metacognition, or more generally, the learning of some formal symbolic language. Indeed, understanding is translation, and formalizing intuitive knowledge means translating it from some internal opaque language to an external observable language.

Second, obviously in order to solve philosophical problems (such as the theory of agents), we need to implement a particular metaphilosophy. But I don't think it needs to has to be extremely rigorous. (After all, if we tried to solve metaphilosophy instead, we would have the same problem.) My informal theory of metaphilosophy is something like: an answer to a philosophical question is good when it seems intuitive, logically consistent and parsimonious[2] after sufficient reflection (where "reflection" involves, among other things, considering special cases and other consequences of the answer, and also connecting the answer to empirical data).


  1. ^

    I think that philosophy just consists of all domains where we don't have consensus about some clear criteria of success. Once such a consensus forms, this domain is no longer considered philosophy. But the reasons some domains have this property at this point of time might be partly coincidental and not especially parsimonious.

  2. ^

    Circling back to the first point, what would a formalization of this within a theory of agents look like? "Parsimony" refers to a simplicity prior, "intuition" refers to opaque reasoning in the core of a metacognitive agent, and "logically consistency" is arguably some learned method of testing hypotheses (but maybe we will have a more elaborate theory of the latter). 

Load More