Computing scientist and Systems architect. Currently doing self-funded AI/AGI safety research.  I participate in AI standardization under the company name Holtman Systems Research: https://holtmansystemsresearch.nl/


Counterfactual Planning

Wiki Contributions



I am not aware of any good map of the governance field.

What I notice is that EA, at least the blogging part of EA, tends to have a preference for talking directly to (people in) corporations when it comes to the topic of corporate governance. As far as I can see, FLI is the AI x-risk organisation most actively involved in talking to governments. But there are also a bunch of non-EA related governance orgs and think tanks talking about AI x-risk to governments. When it comes to a broader spectrum of AI risks, not just x-risk, there are a whole bunch of civil society organisations talking to governments about it, many of them with ties to, or an intellectual outlook based on, Internet and Digital civil rights activism.

Ultimately, all statistical correlations are due to casual influences.

As a regular LW reader who has never been that into causality, this reads as a blisteringly hot take to me.

You are right this is somewhat blistering, especially for this LW forum.

I would have been less controversial for the authors to say that 'all statistical correlations can be modelled as casual influences'. Correlations between two observables can always be modelled as being caused by the causal dependence of both on the value of a certain third variable, which may (if the person making the model wants to) be defined as a hidden variable that cannot by definition be observed.

After is has been drawn up, such a causal model claiming that an observed statistical correlation is being caused by a causal dependency on a hidden variable might then be either confirmed or falsified, for certain values of confirmed or falsified that philosophers love to endlessly argue about, by 1) further observations or by 2) active experiment, an experiment where one does a causal intervention.

Pearl kind of leans towards 2) the active experiment route towards confirming or falsifying the model -- deep down, one of the points Pearl makes is that experiments can be used to distinguish between correlation and causation, that this experimentalist route has been ignored too much by statisticians and Bayesian philosophers alike, and that this route has also been improperly maligned by the Cigarette industry and other merchants of doubt.

Another point Pearl makes is that Pearl causal models and Pearl counterfactuals are very useful of mathematical tools that could be used by ex-statisticians turned experimentalists when they try to understand, and/or make predictions about, nondeterministic systems with potentially hidden variables.

This latter point is mostly made by Pearl towards the medical community. But this point also applies to doing AI interpretability research.

When it comes to the more traditional software engineering and physical systems engineering communities, or the experimental physics community for that matter, most people in these communities intuitively understand Pearl's point about the importance of doing causal intervention based experiments as being plain common sense. They understand this without ever having read the work or the arguments of Pearl first. These communities also use mathematical tools which are equivalent to using Pearl's do() notation, usually without even knowing about this equivalence.

One of the biggest challenges with AI safety standards will be the fact that no one really knows how to verify that a (sufficiently-powerful) system is safe. And a lot of experts disagree on the type of evidence that would be sufficient.

While overcoming expert disagreement is a challenge, it is not one that is as big as you think. TL;DR: Deciding not to agree is always an option.

To expand on this: the fallback option in a safety standards creation process, for standards that aim to define a certain level of safe-enough, is as follows. If the experts involved cannot agree on any evidence based method for verifying that a system X is safe enough according to the level of safety required by the standard, then the standard being created will simply, and usually implicitly, declare that there is no route by which system X can comply with the safety standard. If you are required by law, say by EU law, to comply with the safety standard before shipping a system into the EU market, then your only legal option will be to never ship that system X into the EU market.

For AI systems you interact with over the Internet, this 'never ship' translates to 'never allow it to interact over the Internet with EU residents'.

I am currently in the JTC21 committee which is running the above standards creation process to write the AI safety standards in support of the EU AI Act, the Act that will regulate certain parts of the AI industry, in case they want to ship legally into the EU market. ((Legal detail: if you cannot comply with the standards, the Act will give you several other options that may still allow you to ship legally, but I won't get into explaining all those here. These other options will not give you a loophole to evade all expert scrutiny.))

Back to the mechanics of a standards committee: if a certain AI technology, when applied in a system X, is well know to make that system radioactively unpredictable, it will not usually take long for the technical experts in a standards committee to come to an agreement that there is no way that they can define any method in the standard for verifying that X will be safe according to the standard. The radioactively unsafe cases are the easiest cases to handle.

That being said, in all but the most trivial of safety engineering fields, there is a complicated epistemics involved in deciding when something is safe enough to ship, it is complicated whether you use standards or not. I have written about this topic, in the context of AGI, in section 14 of this paper.

I am currently almost fulltime doing AI policy, but I ran across this invite to comment on the draft, so here goes.

On references:

Please add Armstrong among the author list in the reference to Soares 2015, this paper had 4 authors, and it was actually Armstrong who came up with indifference methods.

I see both 'Pettigrew 2019' and 'Pettigrew 2020' in the text? Is the same reference?

More general:

Great that you compare the aggregating approach to two other approaches, but I feel your description of these approaches needs to be improved.

Soares et al 2015 defines corrigibility criteria (which historically is its main contribution), but the paper then describes a failed attempt to design an agent that meets them. The authors do not 'worry that utility indifference creates incentives to manage the news' as in your footnote, they positively show that their failed attempt has this problem. Armstrong et al 2017 has a correct design, I recall, that meets the criteria from Soares 2015, but only for a particular case. 'Safely interruptible agents' by Orseau and Armstrong 2016 also has a correct and more general design, but does not explicitly relate it back to the original criteria from Soares et al, and the math is somewhat inaccessible. Holtman 2000 'AGI Agent Safety by Iteratively Improving the Utility Function' has a correct design and does relate it back to the Soares et al criteria. Also it shows that indifference methods can be used for repeatedly changing the reward function, which addresses one of your criticisms that indifference methods are somewhat limited in this respect -- this limitation is there in the math of Soares, but not more generally for indifference methods. Further exploration of indifference as a design method is in some work by Everitt and others (work related to causal influence diagrams), and also myself (Counterfactual Planning in AGI Systems).

What you call the 'human compatible AI' method is commonly referred to as CIRL, human compatible AI is a phrase which is best read as moral goal, design goal, or call to action, not a particular agent design. The key defining paper following up on the ideas in 'the off switch game' you want to cite is Hadfield-Menell, Dylan and Russell, Stuart J and Abbeel, Pieter and Dragan, Anca, Cooperative Inverse Reinforcement Learning. In that paper (I recall from memory, it may have already been in the off-switch paper too), the authors offer the some of the same criticism of their method that you describe as being offered by MIRI, e.g. in the ASX writeup you cite.

Other remarks:

In the penalize effort action, can you clarify more on how E(A), the effort metric, can be implemented?

I think that Pettigrew's considerations, as you describe them, are somewhat similar to those in 'Self-modification of policy and utility function in rational agents' by Everitt et al. This paper is somewhat mathematical but might be an interesting comparative read for you, I feel it usefully charts the design space.

You may also find this overview to be an interesting read, if you want to clarify or reference definitions of corrigibility.

As requested by Remmelt I'll make some comments on the track record of privacy advocates, and their relevance to alignment.

I did some active privacy advocacy in the context of the early Internet in the 1990s, and have been following the field ever since. Overall, my assessment is that the privacy advocacy/digital civil rights community has had both failures and successes. It has not succeeded (yet) in its aim to stop large companies and governments from having all your data. On the other hand, it has been more successful in its policy advocacy towards limiting what large companies and governments are actually allowed to do with all that data.

The digital civil rights community has long promoted the idea that Internet based platforms and other computer systems must be designed and run in a way that is aligned with human values. In the context of AI and ML based computer systems, this has led to demands for AI fairness and transparency/explainability that have also found their way into policy like the GDPR, legislation in California, and the upcoming EU AI Act. AI fairness demands have influenced the course of AI research being done, e.g. there has been research on defining it even means for an AI model to be fair, and on making models that actually implement this meaning.

To a first approximation, privacy and digital rights advocates will care much more about what an ML model does, what effect its use has on society, than about the actual size of the ML model. So they are not natural allies for x-risk community initiatives that would seek a simple ban on models beyond a certain size. However, they would be natural allies for any initiative that seeks to design more aligned models, or to promote a growth of research funding in that direction.

To make a comment on the premise of the original post above: digital rights activists will likely tell you that, when it comes to interventions on AI research, speculating about the tractability of 'slowing down AI research' is misguided. What you really should be thinking about is changing the direction of AI research.

I think you are ignoring the connection between corporate governance and national/supra-national government policies. Typically, corporations do not implement costly self-governance and risk management mechanisms just because some risk management activists have asked them nicely. They implement them if and when some powerful state requires them to implement them, requires this as a condition for market access or for avoiding fines and jail-time.

Asking nicely may work for well-funded research labs who do not need to show any profitability, and even in that special case one can have doubts about how long their do-not-need-to-be-profitable status will last. But definitely, asking nicely will not work for your average early-stage AI startup. The current startup ecosystem encourages the creation of companies that behave irresponsibly by cutting corners. I am less confident than you are that Deepmind and OpenAI have a major lead over these and future startups, to the point where we don't even need to worry about them.

It is my assessment that, definitely in EA and x-risk circles, too few people are focussed on national government policy as a means to improve corporate governance among the less responsible corporations. In the case of EA, one might hope that recent events will trigger some kind of update.

Note: This is presumably not novel, but I think it ought to be better-known.

This indeed ought to be better-known. The real question is: why is it not better-known?

What I notice in the EA/Rationalist based alignment world is that a lot of people seem to believe in the conventional wisdom that nobody knows how to build myopic agents, nobody knows how to build corrigible agents, etc.

When you then ask people why they believe that, you usually get some answer 'because MIRI', and then when you ask further it turns out these people did not actually read MIRI's more technical papers, they just heard about them.

The conventional wisdom 'nobody knows how to build myopic agents' is not true for the class of all agents, as your post illustrates. In the real world, applied AI practitioners use actually existing AI technology to build myopic agents, and corrigible agents, all the time. There are plenty of alignment papers showing how to do these things for certain models of AGI too: in the comment thread here I recently posted a list.

I speculate that the conventional rationalist/EA wisdom of 'nobody knows how to do this' persists because of several factors. One of them is just how social media works, Eternal September, and People Do Not Read Math, but two more interesting and technical ones are the following:

  1. It is popular to build analytical models of AGI where your AGI will have an infinite time horizon by definition. Inside those models, making the AGI myopic without turning it into a non-AGI is then of course logically impossible. Analytical models built out of hard math can suffer from this built-in problem, and so can analytical models built out of common-sense verbal reasoning, In the hard math model case, people often discover an easy fix. In verbal models, this usually does not happen.

  2. You can always break an agent alignment scheme by inventing an environment for the agent that breaks the agent or the scheme. See johnswentworth's comment elsewhere in the comment section for an example of this. So it is always possible to walk away from a discussion believing that the 'real' alignment problem has not been solved.

I think I agree to most of it: I agree that some form of optimization or policy search is needed to get many things you want to use AI for. But I guess you have to read the paper to find out the exact subtle way in which the AGIs inside can be called non-consequentialist. To quote Wikipedia:

In ethical philosophy, consequentialism is a class of normative, teleological ethical theories that holds that the consequences of one's conduct are the ultimate basis for judgment about the rightness or wrongness of that conduct.

I do not talk about this in the paper, but in terms of ethical philosophy, the key bit about counterfactual planning is that it asks: judge one's conduct by its consequences in what world exactly? Mind you, the problem considered is that we have to define the most appropriate ethical value system for a robot butler, not what is most appropriate for a human.

Hi Simon! You are welcome! By the way, I very much want to encourage you to be skeptical and make up your own mind.

I am guessing that by mentioning consequentialist, you are referring to this part of Yudkowsky's list of doom:

  1. Corrigibility is anti-natural to consequentialist reasoning

I am not sure how exactly Yudkowsky is defining the terms corrigibility or consequentalist here, but I might actually be agreeing with him on the above statement, depending on definitions.

I suggest you read my paper Counterfactual Planning in AGI Systems, because it is the most accessible and general one, and because it presents AGI designs which can be interpreted as non-consequentualist.

I could see consequentialist AGI being stably corrigible if it is placed in a stable game-theoretical environment where deference to humans literally always pays as a strategy. However, many application areas for AI or potential future AGI do not offer such a stable game-theoretical environment, so I feel that this technique has very limited applicability.

If we use the 2015 MIRI paper definition of corrigibility, the alignment tax (the extra engineering and validation effort needed) for implementing corrigibility in current-generation AI systems is low to non-existent. The TL;DR here is: avoid using a bunch of RL methods that you do not want to use anyway when you want any robustness or verifiability. As for future AGI, the size of the engineering tax is open to speculation. My best guess is that future AGI will be built, if ever, by leveraging ML methods that still resemble world model creation by function approximation, as opposed to say brain uploading. Because of this, and some other reasons, I estimate a low safety engineering tax to achieve basic corrigibility.

Other parts of AGI alignment may be very expensive. e.g. the part of actually monitoring an AGI to make sure its creativity is benefiting humanity, instead of merely finding and exploiting loopholes in its reward function that will hurt somebody somewhere. To the extent that alignment cannot be cheap, more regulation will be needed to make sure that operating a massively unaligned AI will always be more expensive for a company to do than operating a mostly aligned AI. So we are looking at regulatory instruments like taxation, fines, laws that threaten jail time, and potentially measures inside the semiconductor supply chain, all depending on what type of AGI will become technically feasible, if ever.

Corrigibility with Utility Preservation is not the paper I would recommend you read first, see my comments included in the list I just posted.

To comment on your quick thoughts:

  • My later papers spell out the ML analog of the solution in `Corrigibility with' more clearly.

  • On your question of Do you have an account of why MIRI's supposed impossibility results (I think these exist?) are false?: Given how re-tellings in the blogosphere work to distort information into more extreme viewpoints, I am not surprised you believe these impossibility results of MIRI exist, but MIRI does not have any actual mathematically proven impossibility results about corrigibility. The corrigibility paper proves that one approach did not work, but does not prove anything for other approaches. What they have is that 2022 Yudkowsky is on record expressing strongly held beliefs that corrigibility is very very hard, and (if I recall correctly) even saying that nobody has made any progress on it in the last ten years. Not everybody on this site shares these beliefs. If you formalise corrigibility in a certain way, by formalising it as producing a full 100% safety, no 99.999% allowed, it is trivial to prove that a corrigible AI formalised that way can never provably exist, because the humans who will have to build, train, and prove it are fallible. Roman Yampolskiy has done some writing about this, but I do not believe that this kind or reasoning is at the core of Yudkowsky's arguments for pessimism.

  • On being misleadingly optimistic in my statement that the technical problems are mostly solved: as long as we do not have an actual AGI in real life, we can only ever speculate about how difficult it will be to make it corrigible in real life. This speculation can then lead to optimistic or pessimistic conclusions. Late-stage Yudkowsky is of course well-known for speculating that everybody who shows some optimism about alignment is wrong and even dangerous, but I stand by my optimism. Partly this is because I am optimistic about future competent regulation of AGI-level AI by humans successfully banning certain dangerous AGI architectures outright, much more optimistic than Yudkowsky is.

  • I do not think I fully support my 2019 statement anymore that 'Part of this conclusion [of Soares et al. failing to solve corrigibility] is due to the use of a Platonic agent model'. Nowadays, I would say that Soares et al did not succeed in its aim because it used a conditional probability to calculate what should have been calculated by a Pearl counterfactual. The Platonic model did not figure strongly into it.

Load More