A general problem with 'interpretability' work like this focused on unusual errors, and old-fashioned Marcus-style criticisms like 'horse riding astronaut', is that they are generally vulnerable to a modus ponens/tollens reversal, which in the case of AI/statistics/ML, we might call the Approximator's Counter:
Any claim of a flaw in an approximator as compared to an idealized standard, which is not also accompanied by important real-world/decision-relevant performance degradation, may simply disprove the value of that idealized standard.
An illustration from Wittgenstein:
If a contradiction were now actually found in arithmetic—that would only prove that an arithmetic with such a contradiction in it could render very good service; and it would be better for us to modify our concept of the certainty required, than to say it would really not yet have been a proper arithmetic.
In the case of reversal, why do we care?
Because 'it should be logically equivalent'? Except logic sucks. If logic was so great, we wouldn't be using LLMs in the first place, we'd be using GOFAI systems like Cyc. (Which, incidentally, turns out to be essentially fraudulent: there's nothing 'general' about it, and it has degenerated into nothing but thousands of extremely-specialized hand-engineered problem-solving and no longer even does general logical inference at all.) Or we would at least be getting more mileage out of 'hybrid' systems than we do... Logic systems are that guy in the stands yelling that he could've made the shot, while he's not even on the field. Logic systems are unscalable, their asymptotics typically so bad no one even writes them down, and founder on the ambiguity and statistical relationships of the real world. There are no relationships in the real world which can be purely mathematically reversed, because there's always some prior or context or uncertainty which means that one formulation is not the same—this is true even in natural language, where if any logical relationship could be strictly true and equivalent in every way and the statements indiscernible, it ought to be 'A is B' and yet, that's not true, because 'A is B' can often connote something completely different to a listener than the supposedly logically equivalent 'B is A'. A LLM which collapsed 'A is B' and 'B is A' into exactly the same internal representation is lossy, not lossless, and wrong, not right.
Because it affects performance? Except the basic explanation concedes that this does not seem to matter for any of the actual real-world tasks that we use causal/decoder/unidirectional LLMs for, and it has to construct examples to test on. No one cares about Tom Cruise's mother in her own right and would ask 'who is her son?', and so the LLMs do not learn the reversal. If people did start caring about that, then it would show up in the training, and even 1 example will increasingly suffice (for memorization, if nothing else). If LLMs learn by 1-way lookups, maybe that's a feature and not a bug: a 2-way lookup is going to be that much harder to hardwire in to neural circuitry, and when we demand that they learn certain logical properties, we're neglecting that we are not asking for something simple, but something very complex—it must learn this 2-way property only for the few classes of relationships where that is (approximately) correct. For every relationship 'A is B' where it's (approximately) true that 'B is A', there is another relationship 'A mothered B' where 'B mothered A' is (very likely but still not guaranteed to be) false.
And this is a general dilemma: if a problem+answer shows up at least occasionally in the real world / datasets proxying for the real world, then a mere approximator or memorizer can learn the pair, by definition; and if it doesn't show up occasionally, then it can't matter to performance and needs a good explanation why we should care. (If they cannot provide either real-world performance or a reason to care beyond a mere 'i liek logic', then they have merely refuted their idealized standard.)
An explanation might be: while they only show up once as individual datapoints, they show up as a 'class' which can be solved once and this class is common enough to be important as it harshly upper bounds how good our approximator can ever be. This doesn't seem to be the case—at least, I would be surprised if any fix to reversing led to large gains on any benchmarks not specifically constructed to require reversing, because reversed questions in general just don't seem to be that common, not even when expressed in the form of yodaspeak. (Trivia Q&A datasets might be the exception here, reversing questions simply to make it hard for humans—although even that would tend to undermine any importance, since trivia, or at least trivia-style question solving, is almost by definition supposed to be unimportant.)
Another possible response would be to invoke scaling 'hitting the wall': "sure, reversed questions aren't that common and haven't been important enough for LLMs to need to learn before this, as they had so much to learn for regular questions, and that's why it doesn't show up on benchmarks; but they've solved the easy questions now, and now the flaw of reversing is going to start showing up—soon you'll see the scaling exponents change, and the LLMs will flat-line, hobbled by their inability to handle the rare truly new problem requiring logical properties." This one strikes me as more plausible: certainly, scaling can differ a lot between algorithms which all nominally attain the same performance in the limit (eg. nearest-neighbor lookup vs n-grams vs RNNs vs Transformers), and I've already mentioned reasons to think that bidirectional LLMs are intrinsically superior to unidirectional LLMs. Of course, LLMs have been claimed to be about to 'hit the wall' any time now for the past 6 years, so a large gap here is unlikely... Pretraining including reversed data and running scaling law sweeps would test this.
The complexity has been quite minimal. You npm install one executable, which you run on a HTML file in place, and it's done. After the npm install, it's fairly hassle-free after that; you don't even need to host the webfonts if you don't want to. We chose to for some additional speed. (It's not the size, but the latency: an equation here or there will pull in a few fonts which aren't that big, but the loading of a new domain and reflow take time.) IIRC, over the, I dunno, 6 years that I've been using it, there has only been 1 actual bug due to mathjax-node-page: it broke a link in the navbox at the end of pages because the link had no anchor text (AFAICT), which I solved by just sticking in a ZERO WIDTH SPACE. All my other work related to it has been minor optimizations like rehosting the fonts, stripping a bit of unnecessary CSS, adding an optimization setting, etc. Considering how complicated this feature is, that's quite impressive reliability. Many much simpler features, which deliver far less value, screw up far more regularly than the static MathJax compilation feature does.
Anna: 60 plus 60 makes a hundred
- Anna: Papa, are onions fruit or dairy?
- Me: onions are vegetables
- Anna: Papa, I have a problem: [explains problem in detail]
- Me: Is it a problem you can fix?
- Anna: Yes! [Fixes problem]
- Anna: welcome to my restaurant! We have donuts, and we have jelly.
- Me: can I have a jelly donut?
- Anna: at this restaurant we do not combine foods!
Lily: I did kind of follow the recipe, I just didn't read it
What adorable little LLMs.
Another little LLM says:
if you had a hundred random people, and 50 of them went swimming in a tank with a large number of hungry sharks, and 50 of them went into a cage with a large number of hungry wolves, which group would do worse?
The group swimming with sharks would likely fare worse. Sharks, as aquatic predators, are more efficient at hunting in their environment than humans are at evading in water. Humans in water are more disadvantaged due to limited mobility and lack of natural underwater survival instincts. Wolves, while dangerous, are terrestrial creatures and humans have more options for evasion and defense on land.
I agree.
OK, I think I see what the argument here actually is. You have 2 implicit arguments. First: 'humans learn reversed relationships and are not fundamentally flawed; if NNs fundamentally learned as well as humans and were not fundamentally flawed and learned in a similar way, they would learn reversed relationships; NNs do not, therefore they do not learn as well as humans and are fundamentally flawed and do not learn in a similar way'. So a decoder LLM not doing reversed implies a fundamental flaw. Then the second argument is, 'human brains do not learn using reversing; a NN learning as well as humans using reversing is still not learning like a human brain; therefore, it is fundamentally flawed', and the conjunction is that either a LLM does worse than humans (and is flawed) or 'cheats' by using reversing (and is flawed), so it's flawed.
Barring much stronger evidence about humans failing to reverse, I can accept the first argument for now.
But if reversing text during pretraining, or the near-strict equivalent of simply switching between mask-all-but-last and mask-all-but-first targets while doing prediction, fixed reversed relationships, that second implicit argument seems to not follow to me, because the premise is unsupported & doubtful.
We don't know how humans learn, and so for all we know, human brains doing self-supervised learning could be reversing. If human brains maintained any sort of equivalent of 'context', then they can be doing any arbitrary masking & predictive loss over that context (and probably are). It's unclear that this would even come at any additional compute cost. (I could see mixing in reversal targets as strictly superior to standard causal decoder last-token-only - diminishing returns to yet another last-token-only datapoint, the implicit data augmentation, and the empirical success of bidirectional models like UL2 using more than just last-token-only or the recent argument that strict causal decoding is intrinsically inferior to bidirectional losses because the former is datapoint-by-datapoint online learning and the latter allows full SGD minibatch learning.)
So if reversal training really does fix the reversal problems, all the reversal observations seems to show is that bidirectional models are smarter than unidirectional models when bracketing out concerns like training compute-efficiency and are more brain-like, neither of which seems too controversial to me; and suggests a minor tweak to LLM training (to, say, preprocess half the data beforehand to reverse it), and thus make this more an obscurely-framed capabilities result than anything particularly safety/interpretability-relevant.
I can't replicate this with my Ubuntu Linux/MATE/Firefox/Emacs setup. I get the whole equation no matter how I copy it.
(Note that there is one catch to the JS copy-paste listener: confusingly to contemporary users, X.org has multiple copy-paste buffers, 'primary' / 'secondary' / 'copypaste', of which browsers will apparently only allow web page JS to affect the first one. Since the browser doesn't cooperate, this cannot be fixed by the webpage. So if you copy-paste in X.org, depending on how you do it, you may get the intended P(xi)=<xi,v>
or you may get that newline-after-every-character version that jefftk quotes. If you are unsure what is going on, you can investigate using the xclip
utility, like xclip -o -selection copypaste
vs xclip -o -selection primary
.)
This seems like such an obvious question that I'm worried I'm missing something but... you phrase it as 'A to B doesn't cause B to A', and people are using examples like 'you can't recite the alphabet backwards as easily as you can forwards', and when I look at the list of 'different training setups', I see the very most obvious one not mentioned:
It’s possible that a different training setup would avoid the Reversal Curse. We try different setups in an effort to help the model generalize. Nothing helps. Specifically, we try:
Why wouldn't simply 'reversing the text during pretraining' fix this for a causal decoder LLM? They only have a one-way flow because you set it up that way, there's certainly nothing intrinsic about the 'predict a token' which constrains you to causal decoding - you can mask and predict any darn pattern of any darn data you please, it all is differentiable and backpropable and a loss to minimize. Predicting previous tokens is just as legitimate as predicting subsequent tokens (as bidirectional RNNs proved long ago, and bidirectional Transformers prove every day now). If the problem is that the dataset is chockful of statements like “Who won the Fields Medal in 1990? Ed Witten” and not “For his work did Ed Witten win a Fields Medal in 1990”, then reversing the text would seem to reverse most of them and create the B->A versions. I mean, if I had spent as much time as a child singing the alphabet song backwards as I did singing it forward, I expect that I would have little trouble going backwards in the alphabet as fluently as I do forwards!
(It's unclear to me that this would even come at much of an expense in pretraining if you reversed half the inputs at random, because it's still a powerful training signal.)
I don't disagree, but none of the things you pointed out are actually breakage as far as I can tell:
I didn't say they were. If you are 'skating on thin ice', you have by definition not fallen through and started to drown, because you can't skate and drown simultaneously. (At least, I can't.) My point is that you are engaged in sloppy coding practices, and so it's unsurprising that you are making mistakes like casually assuming that MathML can be copied around or would be compatible with random web applications, when you should know that the default assumption is that MathML will be broken everywhere and must be proven supported. That Internet math support is parlous is nothing new.
because it only ever appears immediately to then right of something that has margin-right: 1em.
Until, of course, it doesn't, because you refactored or something, and hit a spot of particularly thin ice.
but it's not really wrong to include it.
Not at all. (My site has a few instances of unnecessary type
declarations not worth ripping out.) I merely quoted that for the nonce part, which did concern me. CSP is one of the most arcane and frustrating areas of web dev, and the less one has to do with it, the better. Leaving in anything to do with CSRF or CSP or framejacking is indeed tempting fate.
As long as you verify that you're coding to a standard that's supported by the versions of the browsers you're trying to support, what sort of breakage are you thinking about?
Web dev is crack & AIDS. We run into problems all the time where we code to a standard and then it breaks in Chrome or Firefox.
The day before yesterday I discovered that when I added dropcaps to my essay on why cats knock things over, it looked fine in Chrome... and bad in Firefox, because they define 'first letter' differently for the opening word 'Q-tips'. (Firefox includes the hyphen in the "first letter", so the hyphen was getting blown up to the size of the drop cap!) My solution was to put a space and write it 'Q -tips'. Because we live in a world without a just and loving god and where standards exist to be honored in the breach.
Especially in Safari, which was created by a fallen demiurge in a twisted mockery of real browsers. Yesterday, Said had to fix a Safari-specific bug where the toggle bar breaks & vanishes on Safari. Worked fine everywhere else, coded against the standard... He also had to polyfill the standardized crypto.randomUUID
(2021) for iOS.
And today Said removed the CSS-standardized-and-deployed-since-at-least-2015 property box-decoration-break
and -webkit-box-decoration-break
from Gwern.net because it breaks in Safari. ('webkit' = 'Safari', for the non-web-devs reading this. Yes, that's right, the Safari version breaks in Safari, on top of the standardized version breaking in Safari for which the Safari version was supposed to be the fix. Good job, Apple! Maybe you can fix that after you get around to fixing your Gill Sans which renders everything written in it full of random typos? And then make your browser hyphenation not suck?) He also had to remove hanging-punctuation
due to its interaction with the link text-shadows on Safari, but arguably link text-shadows are a hack which hanging-punctuation
shouldn't try to play well with, so might be our fault.
I look forward to tomorrow. (That was sarcasm. If every day were like this, I would instead look forward to the sweet release of death.)
You can also use GPT to convert LaTeX to HTML/Unicode, incidentally. For simple inline expressions, this is very good. Like, there is not actually a need to use LaTeX or MathML to render <em>e</em><sup><em>i</em>π</sup>
. That works fine in HTML+Unicode, and winds up looking better than an obtrusive MathML/LaTeX block, where even something as simple as $1$
winds up looking visibly alien and inserted.
Daring Fireball is a site one has primarily heard of for being an Apple/Mac shill, so perhaps not the best example of a website relying on OS-supplied fonts...
He means 'µ-Parametrization' or 'µP' and is just being lazy about not spelling out 'mu-parameterization' as the standard ASCII form.