gwern

Comments

Zibbaldone With It All

Why can't any individual 'item' be shown with context like a dozen lines before/after (eg fading out)?

From GPT to AGI

I would expect that with increased model size it will be possible to increase the attention field by a lot without much need for additional AI insight.

It's not model size/parameters, it's the cost of the self-attention at runtime.The number of parameters to expand self-attention is linear but the runtime memory consumption goes up quadratically. Even a GPT-2-117M can use up to like 300GB RAM if you increase the window to 30k. You need more efficient attention or alternative architectures.

interpreting GPT: the logit lens

I think this might suggest there is some fundamentally better way to do sampling from GPT models? I'm having trouble writing out the intuition clearly, so I'll leave it for later posts.

Unroll the sampling process: hook up all the individual GPT instances into a single long model, bypass the discretizing/embedding layers to make it differentiable end-to-end, and do gradient ascent to find the sequence which maximizes likelihood conditional on the fixed input.

Zibbaldone With It All

I call this "anti-spaced repetition": the benefit is from surfacing connections for material you've forgotten (as opposed to reviewing material you still remember so as to strengthen retention). You can optimize time spent reviewing older material by using the spacing effect to estimate which things have been forgotten for the longest - same equation, just optimizing for something else.

How hard would it be to change GPT-3 in a way that allows audio?
Answer by gwernAug 28, 202023

The stock GPT model, because it uses dense attention which works best at hundreds / thousands length, isn't suitable for any kind of raw audio, which involves extremely long sequences of millions of tokens at the millisecond level. (A WAV may be scores of megabytes long; even a highly optimized lossy encoding like MP3 or Vorbis is still megabytes for most music.) If you tried, it'd be a failure because 1024 or 2048 tokens would encode all of a few milliseconds of audio at best, and it's impossible to meaningful predict based on a few milliseconds; most sounds or phonemes or musical notes are far longer than that! You can use it for very high level encodings like ABC notation or, if you brute force it a bit, you can generate MIDI via ABC. See https://www.gwern.net/GPT-2-music This will let you generate folk or instrumental style music with a few instruments in a simple style. (Note the hack that iGPT resorts to, with pixel-encoding, to make even tiny images of 64px workable with enormous compute - because that's a 64^2^ RGB image is a 'sequence' of l=64*64*3=12,288, which is well into the painful territory for dense GPT.)

MuseNet goes one level below ABC by operating on a MIDI encoding of music. This requires shifting from dense attention to a more scalable attention, in its case, Sparse Transformers, which can handle lengths of tens of thousands with acceptable compute requirements & quality. MuseNet was better but still fairly limited. (Not raw audio, a few instruments, definitely no voices etc.)

Jukebox operates at the raw audio level, and it does this by using much larger models scaled up (<10b parameters), conditioned on lyrics/artist metadata (from n~1m songs, IIRC), and a hybrid architecture: not just Sparse Transformers, but VAE-style codebooks providing discrete embeddings of the music style for more global consistency compared to a pure autoregressive token-by-token approach like GPT/MuseNet. Jukebox is extremely impressive: it generates raw audio, for most genres of music, in the style of specific artists, and it even learns to synthesize singing voices (!). It doesn't quite have the global coherency that GPT or MuseNet samples can achieve, like choruses, because I think its attention window is still de facto limited to something like 20 seconds, which limits learning & long-range coherency; but I think fixing that's just a matter of adding on another layer in the hierarchy and maybe another order parameters, and that would fix much of the remaining quality gap.

Jukebox suggests that if you created a large enough model, you could probably dispense with the VAE part and just use pure Transformers.

Are We Right about How Effective Mockery Is?

Almost certainly. And of course, Scott doesn't mention "Richard Dawkins" (the example was /r/atheism) so no wonder my keyword search didn't turn it up... Oh well.

Are We Right about How Effective Mockery Is?

I vaguely remember, somewhere on LW or SSC probably, a discussion of Richard Dawkins and the New Atheists - as absurd and risible as their screeds and juvenile mockery of Judeochristian religion were, one could point out that it seemed to work on a lot of people. Apparently there were a lot of people who saw the mockery, saw that there was even semi-plausible arguments against religion, saw that you could be a perfectly successful normal person like Dawkins without being smote by lightning (or ostracization) for saying that God doesn't exist, and that was enough to trigger deconversion; theism was taken so for granted in their local environment that they hadn't really thought about it meaningfully, or they had had growing doubts but no awareness of the debate, or thought they were alone etc, and seeing the New Atheists be a thing mocking religion was enough.

On Suddenly Not Being Able to Work

Looks like a ceiling effect: a large fraction of turns just have easy or obvious movies for a player, which everyone is more than capable of solving near-perfectly (they hit the ceiling) so bad conditions don't much affect blunders (because the conditions aren't bad enough to pull an appreciable number of moves/players down below the ceiling to start making huge blunders), but the bad conditions do still affect the hard moves, and increase the errors in those.

(Imagine drawing a curve with a vertical line at the minimum skill necessary to compete in these. Everything to the left of it is an 'easy' move and all players solve it, while to the right, they are 'hard' moves where players increasingly likely make more expensive mistakes. Bad conditions move the curve diagonally up-right: the vertical line remains the same since the players don't change, and the number of moves which flip from 'easy' to 'hard' changes by a relatively small %, as only a few moves cross the line, but all the moves to the right of it become harder and the mistakes increasingly expensive.)

"Assessing Human Error Against a Benchmark of Perfection", Anderson et al 2016, indicates that human GMs match the chess engine's predicted best move about half the time. This suggests that a lot of moves are 'solved' in the sense that either the move is very obvious (the opening book, forced moves), or the baseline of competency at GM level easily handles them - leaving only the other half of hard moves as critical moves which contribute to victory or defeat. Table A.1 seems to imply that ~55% of moves are classified as errors (15k/27k), so seems similar.

Forecasting Thread: AI Timelines

I was looking at the NIPS growth numbers last June and I made a joke:

AI researcher anthropics: 'researchers [should] tend to think AI is ~20 years away because given exponential growth of researchers & careers of ~30 years, the final generation of researchers will make up a majority of all researchers, hence, by SSA+Outside View, one must assume 20 years.'

(Of course, I'm making a rather carbon-chauvinistic assumption here that it's only human researchers/researcher-years which matter.)

SDM's Shortform

I'm not sure what's going on here - is it the initial prompt saying it was 'testing physical and common sense reasoning'? Was that all it took?

Entirely possibly. Other people have mentioned that using any prompt (rather than just plopping the stories in) solves a lot of them, and Summers-stay says that Marcus & Davis did zero prompt programming and had no interest in the question of what prompt to use (quite aside from the lack of BO). I think they found the same thing, which is why they provide the preemptive excuse in the TR writeup:

Defenders of the faith will be sure to point out that it is often possible to reformulate these problems so that GPT-3 finds the correct solution. For instance, you can get GPT-3 to give the correct answer to the cranberry/grape juice problem if you give it the following long-winded frame as a prompt:

Load More