I don't get the impression that RLHF needs hacks to prevent mode collapse: the InstructGPT reports overfitting leading to better human-rater feedback, and the Anthropic HH paper mentions in passing that the KL penalty may be wholly irrelevant (!).
But IIRC, doesn't OA also mention that to get better results they had to add in continual training of the model on the original raw data? That's much like a KL penalty. (I don't recall the RL-training part of the Anthropic HH paper, just the prompted agent parts.)
You suggest that td-003 mode collapses where td-002 is perfectly capable. So you believe that both td-002 and td-003 mode collapse, in disjoint cases (given the examples from the original mode collapse post)?
The original Janus post only covers -002, not -003. So my initial opinion was a bit of a surprise about the claim that 002 was RLHF, because at least working with samples, 003 seemed much more mode-collapsed than 002 did, and 002 was much more mode-collapsed than davinci (although that might not be a fair comparison, I haven't worked with the code-davinci-002 model enough to have an opinion about it, so that's just my impressive: in terms of diversity, davinci > 002 > 003); I didn't expect 003 to be much worse than 002 if they were trained about the same, just with updated datasets or something. I figured that perhaps 003 was just much 'more so', and this was the effect of training more, or something uninteresting like that. The later discussion of 002 being closer to instruction-tuning than RLHF seemed to resolve that anomaly for me: if 003 is both RLHF & instruction-tuned, then of course it might be much more collapsey than 002.
But 003 being more collapsey doesn't mean 002 isn't collapsed at all. It's just... less 'stubborn' about it? Yeah, it'll give you the mediocrity by default, but it's much easier to prompt or few-shot it into desired behavior than 003, so I barely even notice it.
My initial impressions of the GPT-4 chat mode which I have been using via the API/Playground and not bothering with the ChatGPT interface*, which is apparently RLHFed or something, is similar: I'm much happier with my poetry samples from GPT-4 because while again the default poetry output is repugnantly reminiscent of ChatGPT's poetry (albeit much better, of course), I can easily prompt it to get what I want. It still seems to subtly keep falling into rhyming couplets if it can, but it's not incorrigible the way ChatGPT-3.5 is.
* I don't think this should make a difference but I mention it anyway.
This is an interesting attempt, but doesn't convince me that Janus is wrong about the phenomenon. There is a long history of RL having mode-collapse for intuitive reasons and needing hacks to stop it, the GPT-4 paper confirms that the RL version of the model can act drastically differently in terms of calibration, and the narrowness of ChatGPT/003 is incredibly striking: every single person who generates more than a few poems seems to remark on the obstinate refusal to generate the kinds of poems that 002 or davinci generate with the greatest of ease (complete with extreme hallucinations/lying about how it is not writing rhyming poetry in accordance with your explicit instructions, while continuing to do just that). The other night I spent a few hours playing with Japanese waka verse and the difference between 002 & 003 was night & day - I didn't get out a single usable poem from 003 because they were all hopelessly Hallmarkesque (although at least they didn't rhyme), but 002 had no issue varying topics or even switching forms entirely. It is difficult for me to not believe my lying eyes.
My takeaway is that we are probably seeing some issue with how it's being measured here, plus Janus might be 'right for the wrong reasons' in identifying the phenomenon (which is so obvious to anyone who uses the models) but the specific examples being coincidence or unreplicable. (The graphs could use improvement. As Jenner points out, the davinci comparison may not be relevant because that's from long ago. It would also be helpful to keep the colors consistent. I was confused until I realized the graphs were different.)
There may be something misleading about examining only total entropy, rather than something more meaningful: most of the effects of the RL policy will show up in the 'long run', in terms of the rare tokens that the model does or doesn't use. Sets of samples can be highly repetitive and redundant in a way which doesn't show up necessarily as the myopic next-token BPE entropy decreasing. When a mediocre chess player plays a chess grandmaster, he may have many equally good options, which nevertheless all wind up in the same place: victory. Consider Paul Christiano's wedding party example: it seems entirely possible to me that, token by token, the entropy of the model might be fairly similar to many other prompt/model combinations, and often exceed them (wacky hijinks happening along the way to the wedding party) and one would be unable to pick it out of graphs like these; but there is clearly some sort of 'lower entropy' going on when every story ends in a wedding party!
Perhaps the entropy needs to be measured over longer sequences? Or measured by distance of embeddings in more of a 'topic modeling' view? (The embeddings might capture better what people mean by the samples all 'sound the same' or 'it won't do other kinds of poetry' or 'it only writes cheerful Hallmark-level pablum': they may be completely different at a token level, and entropy of tokens similar because the model has many alternative wording choices which nevertheless all end up in the same place, but they will be very close in embedding-space because of their lack of variety in semantics, presumably. So you could just feed the various samples into the OA embedding API, or whatever document-level embedding you prefer, and see how they differentially cluster.)
This is also relevant to understanding why the genre of off-the-cuff tossoffs like "what if corporations are the real superintelligence" or "why can't we solve AGI alignment the same way we solved 'aligning corporations'?" are so wrong. Corporations are not superintelligences. They are, in fact, extremely stupid, much stupider than the sum of their parts, and subject to only the weakest forms of natural selection due to their inability to replicate themselves reliably despite the permanent existence of very large dispersion in efficiency/quality between corporations. They are so stupid that they cannot walk and chew bubblegum at the same time, and must choose, because they can only have 1 top priority at a time - and CEOs exist mostly to repeat the top priority that "we do X". They are simply very large and parallel, and the least-bad organizations human minds can form reliably at present given the blackbox of humans and inability to copy them.
Microtargeting groups has already been observed to cause preference drift over time leading to large consequences (e.g. Cambridge Analytica).
Reminder: no, it didn't, as you can see by scrolling the relevant WP section about impact (ie. consequences), not the irrelevant scandal section. We know from actual political advertising experiments their data was far too crude to make any impact (not that anything they said wasn't a tissue of advertising & lies), and they weren't even used by the Trump campaign!
Interesting tweet from Marcus 2 days ago:
There can be zero doubt that GPT-4 is better than GPT-3–but also I cannot imagine how we are supposed to achieve ethical and safety “alignment” with a system that cannot understand the word “third” even with billions of training examples.
There are many easy ways to incorporate vision. Vision+text models are a dime a dozen these days - as I said, this currently looks like 'DALL-E 1 but bigger' (VQVAE tokens -> token sequence -> autoregressive modeling of text/image tokens). What we have seen so far doesn't look like 3 years of progress by the best DL researchers.
The lack of GPT-4 in 2020-mid-2021 wasn't too surprising to me. They were busy productizing, optimizing, launching, and had no genuine competition. Everyone with a plausibly competitive model was not releasing it, and the ones which were available were not convincingly better. Why invest or release? Jurassic-1 in July 2021 was the first public API, but I never heard anyone call it noticeably better than davinci. Tick-tock...
What I find a little more curious is no successor in 2021-2022, and that it wasn't until August 2022 that GPT-4 finished training, with what sounds like about 6 months of training, so it hadn't even started until like February 2022. This is a bit odd. The vibe I had been getting, particularly from the Altman ACX meetup, was that GPT-4 was considerably firmer than 'we'll start training GPT-4 for real in, idk, a year or whenever we get around to it, it's nbd'. Particularly with so much going on in scaling in general.
One working hypothesis I had was that they were planning on something much more impressive than GPT-4 is (or at least, appears to be), but that doesn't seem to be the case. GPT-4, as described, looks more or less like, 'what if the scaling hypothesis was true and then DALL-E 1 but way bigger and mostly just text?' Or to put it another way, what we see looks an awful lot like what you might've predicted in May 2020 as the immediate obvious followup, not what you might've predicted in February 2022 as the followup. That is, GPT-3 would've been finalized around like June 2019 & halted around Dec 2019 IIRC, and GPT-4 around February 2022 - but it just doesn't look like 3 years of advancement, by that many researchers I respect that much while just the known published results elsewhere are so amazing. (Yes, Whisper, CLIP, DALL-E 2, GPT-f etc, but not sure that's enough.) So I definitely feel like I am missing something in my understanding, but I'm unsure if it's some major advance hiding inside their pointed omission of all details or if there was some sort of major R&D mishap where they invested a lot of effort into a failed MoE approach, or what.
Is this a trick question? Obviously the blog posts. The em book (based heavily on blog drafts) had zero impact and is based on a vision of the future that recedes every day (if it's lucky, it might get revived in a few decades as period science fiction), and the co-authored Elephant book was just a popularization of the blog posts. The academic publications may look prestigious but let's be real: not 1 person in 100 majorly affected by Robin Hanson could tell you the name or date of his first prediction market article, what market-maker mechanism he invented (not that the cryptocurrency guys would've cared where he wrote it up), anything about combinatorial auctions (which still largely don't exist), what his theory of medical economics is, what his formal model of singularity economics says (you can tell because no one is bringing it up now despite being quite relevant to DL macro), and... I have no idea what the rest of those 100 papers might cover even though I really ought to because I've been reading all his stuff for almost 20 years now.