I'm currently transitioning to full-time alignment research. DMs open :)
However, if it is the case that the difference between humans and monkeys is mostly due to a one-shot discrete difference (ie language), then this cannot necessarily be repeated to get a similar gain in intelligence a second time.
Perhaps language is a zero-one, i.e. language renders a mind "cognitively complete" in the sense that the mind can represent anything about the external world, and make any inferences using those representations. But intelligence is not thereby zero-one because intelligence depends on continuous variables like computional speed, memory, etc.
More concretely, I am sceptic that "we end up with AI geniuses, but not AI gods", because running a genius at 10,000x speed, parallelised over 10,000x cores, with instantaneous access to the internet does (I think) make an AI god. A difference is quantity is a difference in kind.
Thar said, there might exist plausible threat models which require an AI which doesn't spatiotemporally decompose into less smart AIs. Could you sketch one out?
Thanks for the summary.
Sorry for any confusion. Meta only tested LIMA on their 30 safety prompts, not the other LLMs.
Figure 1 does not show the results from the 30 safety prompts, but instead the results of human evaluations on the 300 test prompts.
The way the authors phrase the Superficial Alignment Hypothesis is a bit vague, but they do offer a more concrete corollary:
If this hypothesis is correct, and alignment is largely about learning style, then a corollary of the Superficial Alignment Hypothesis is that one could sufficiently tune a pretrained language model with a rather small set of examples.
Regardless of what exactly the authors mean by the Hypothesis, it would be falsified if the Corollary was false. And I'm arguing that the Corollary is false.
(1-6) The LIMA results are evidence against the Corollary, because the LIMA results (post-filtering) are so unusually bare (e.g. no benchmark tests), and the evidence that they have released is not promising.
(7*) Here's a theoretical argument against the Corollary:
The authors don't clarify what "sufficiently" means in the Corollary, so perhaps they have much lower standards, e.g. it's sufficient if the model responds safely 80% of the time.
Nope, no mention of xrisk — which is fine because "alignment" means "the system does what the user/developer wanted", which is more general than xrisk mitigation.
But the paper's results suggest that finetuning is much worse than RLHF or ConstitutionalAI at this more general sense of "alignment", despite the claims in their conclusion.
In a controlled human study, responses from LIMA are either equivalent or strictly preferred to GPT-4 in 43% of cases;
I'm not sure how well this metric tracks what people care about — performance on particular downstream tasks (e.g. passing a law exam, writing bugless code, automating alignment research, etc)
Yep.
Specifically, it's named for the papers HiPPO: Recurrent Memory with Optimal Polynomial Projections and How to Train Your HiPPO: State Space Models with Generalized Orthogonal Basis Projections, which kicked off the whole transformers for state-space modelling thing
"HiPPO" abbreviates higher-order polynomial projection operators.