PhD student at the Center for Human-Compatible AI. Creator of the Alignment Newsletter.


Value Learning
Alignment Newsletter


Forecasting Thread: AI Timelines

Planned summary for the Alignment Newsletter:

This post collects forecasts of timelines until human-level AGI, and (at the time of this writing) has twelve such forecasts.

Model splintering: moving from one imperfect model to another

Planned summary for the Alignment Newsletter:

This post introduces the concept of _model splintering_, which seems to be an overarching problem underlying many other problems in AI safety. This is one way of more formally looking at the out-of-distribution problem in machine learning: instead of simply saying that we are out of distribution, we look at the model that the AI previously had, and see what model it transitions to in the new distribution, and analyze this transition.

Model splintering in particular refers to the phenomenon where a coarse-grained model is “splintered” into a more fine-grained model, with a one-to-many mapping between the environments that the coarse-grained model can distinguish between and the environments that the fine-grained model can distinguish between (this is what it means to be more fine-grained). For example, we may initially model all gases as ideal gases, defined by their pressure, volume and temperature. However, as we learn more, we may transition to the van der Waal’s equations, which apply differently to different types of gases, and so an environment like “1 liter of gas at standard temperature and pressure (STP)” now splinters into “1 liter of nitrogen at STP”, “1 liter of oxygen at STP”, etc.

Model splintering can also apply to reward functions: for example, in the past people might have had a reward function with a term for “honor”, but at this point the “honor” concept has splintered into several more specific ideas, and it is not clear how a reward for “honor” should generalize to these new concepts.

The hope is that by analyzing splintering and detecting when it happens, we can solve a whole host of problems. For example, we can use this as a way to detect if we are out of distribution. The full post lists several other examples.

Planned opinion:

I think that the problems of generalization and ambiguity out of distribution are extremely important and fundamental to AI alignment, so I’m glad to see work on them. It seems like model splintering could be a fruitful approach for those looking to take a more formal approach to these problems.

Forecasting Thread: AI Timelines

My snapshot:

Idk what we mean by "AGI", so I'm predicting when transformative AI will be developed instead. This is still a pretty fuzzy target: at what point do we say it's "transformative"? Does it have to be fully deployed and we already see the huge economic impact? Or is it just the point at which the model training is complete? I'm erring more on the side of "when the model training is complete", but also there may be lots of models contributing to TAI, in which case it's not clear which particular model we mean. Nonetheless, this feels a lot more concrete and specific than AGI.

Methodology: use a quantitative model, and then slightly change the prediction to account for important unmodeled factors. I expect to write about this model in a future newsletter.

How good is humanity at coordination?

Relevant evidence: survey about the impact of COVID on biorisk. I found the qualitative discussion far more useful than the summary table. I think overall the experts are a bit more pessimistic than would be predicted by my model, which is some evidence against my position (though I also think they are more optimistic than would be predicted by Buck's model). Note I'm primarily looking at what they said about natural biorisks, because I see COVID as a warning shot for natural pandemics but not necessarily deliberate ones.

(Similarly, on my model, warning shots of outer alignment failures don't help very much to guard against inner alignment failures.)

Mesa-Search vs Mesa-Control

Planned summary for the Alignment Newsletter:

This post discusses several topics related to mesa optimization, and the ideas in it led the author to update towards thinking inner alignment problems are quite likely to occur in practice. I’m not summarizing it in detail here because it’s written from a perspective on mesa optimization that I find difficult to inhabit. However, it seems to me that this perspective is common so it seems fairly likely that the typical reader would find the post useful.

Happy for others to propose a different summary for me to include. However, the summary will need to make sense to me; this may be a hard challenge for this post in particular.

Mesa-Search vs Mesa-Control
I lean toward there being a meaningful distinction here: a system can learn a general-purpose learning algorithm, or it can 'merely' learn a very good conditional model.

Does human reasoning count as a general-purpose learning algorithm? I've heard it claimed that when we apply neural nets to tasks humans haven't been trained on (like understanding DNA or materials science) the neural nets can rocket past human understanding, with way less computation and tools (and maybe even data) than humans have had access to (depending on how you measure). Tbc, I find this claim believable but haven't checked it myself. Maybe SGD is the real general-purpose learning algorithm? Human reasoning could certainly be viewed formally as "a very good conditional model".

So overall I lean towards thinking this is a continuous spectrum with no discontinuous changes (except ones like "better than humans or not", which use a fixed reference point to get a discontinuity). So there could be a meaningful distinction, but it's like the meaningful distinction between "warm water" and "hot water", rather than the meaningful distinction between "water" and "ice".

Mesa-Search vs Mesa-Control

Random question: does this also update you towards "alignment problems will manifest in real systems well before they are powerful enough to take over the world"?

Context: I see this as a key claim for the (relative to MIRI) alignment-by-default perspective, and I expect many people at MIRI disagree with this claim (though I don't know why they disagree).

Universality Unwrapped

Planned summary for the Alignment Newsletter:

This post explains the ideas behind universality and ascription universality, in a more accessible way than the original posts and with more detail than my summary.
Matt Botvinick on the spontaneous emergence of learning algorithms
Which is exactly why I asked you for recommendations.

Yes, I never said you shouldn't ask me for recommendations. I'm saying that I don't have any good recommendations to give, and you should probably ask other people for recommendations.

showing some concrete things that might be relevant (as I repeated in each comment, not an exhaustive list) would make the injunction more effective.

In practice I find that anything I say tends to lose its nuance as it spreads, so I've moved towards saying fewer things that require nuance. If I said "X might be a good resource to learn from but I don't really know", I would only be a little surprised to hear a complaint in the future of the form "I deeply read X for two months because Rohin recommended it, but I still can't understand this deep RL paper".

If I actually were confident in some resource, I agree it would be more effective to mention it.

I'm just confused because it seems low effort for you, net positive, and the kind of "ask people for recommendation" that you preach in the previous comment.

I'm not convinced the low effort version is net positive, for the reasons mentioned above. Note that I've already very weakly endorsed your mention of Sutton and Barto, and very weakly mentioned Spinning Up in Deep RL. (EDIT: TurnTrout doesn't endorse Sutton and Barto much, so now neither do I.)

Load More