Neel Nanda

Sequences

My Overview of the AI Alignment Landscape

Wiki Contributions

Comments

Neel Nanda2dΩ153417

I'll mention here that from what little I saw, your coordination efforts seemed a bit misguided and harmful. I am left with a sense that you wanted everyone to be friends and not feel threatened by interacting with others, to feel like "come in, the water is fine, don't worry, you won't end up with people criticizing you for maybe ending civilization or self-deceiving along the way or call you unethical". While I am pretty open to fairly respectful coordination and am a strong fan of finding positive-sum trade, I care more about being frank and honest in my interactions, and a route must be found where communicating such things (insofar as that's what someone believes) isn't going to destroy or end the coordination/trade agreement. Speaking the truth is not something to be traded away, however costly it may be.

I can't comment on Conjecture specifically's coordination efforts, but I fairly strongly disagree with this as a philosophy of coordination. There exist a lot of people in the world who have massive empirical or ethical disagreements with me that lead to them taking actions I think are misguided to actively harmful to extremely dangerous. But I think that this often is either logical or understandable from their perspective. I think that being able to communicate productively with these people. see things from their point of view, and work towards common ground is a valuable skill, and an important part of the spirit of cooperation. For example, I think that Leah Garces's work cooperating with chicken farmers to reduce factory farming is admirable and worthwhile, and I imagine she isn't always frank and honest with people.

In particular, I think that being frank and honest in this context can basically kill possible cooperation. And good cooperation can lead to things being better by everyone's lights, so this is a large and important cost not worth taking lightly. Not everyone has to strive for cooperation, but I think it's very important that at least some people do! I do think that being so cooperative that you lose track of what you personally believe can be misguided and corrosive, but that there's a big difference between having clear internal beliefs and needing to express all of those beliefs.

Neel Nanda3dΩ133317

Another reason for publishing more quickly is that conversations with many interpretability researchers have led us to believe that there is a wealth of knowledge in short experiments and unpublished research that really should be shared. We’d encourage other organizations who think similarly to post frequently, and share results even if they’re not completely polished.

Extremely strong +1! There is so much low-hanging fruit in mechanistic interpretability (of the flavour "I could probably spend a full-time day working on this and find something mildly cool worth writing up into a rough blog post"). I would love a wealth of these posts to exist that I could point people to and read myself! I've tried to set myself a much lower bar for this, and still mostly procrastinated on this. I would love to see this.

EDIT: This is also a comparative advantage of being an org outside academia whose employees mostly aren't aiming for a future career in academia. I gather that in standard academic incentives, being scooped on your research makes the work much less impressive and publishable and can be bad for your career, discincentivising discussing partial results, especially in public. This seems pretty crippling to having healthy and collaborative discourse, but it's also hard to fault people for following their incentives!

More generally, I really appreciate the reflective tone and candour of this post! I broadly agree re the main themes and that I don't think Conjecture has really made actions that cut at the hard core of alignment, and these reflections seem plausible to me re concrete but fixable mistakes and deeper and more difficult problems. I look forwards to seeing what you do next!

Nope. Though that one is pretty fair - Arxiv must get a lot of submissions

I'd recommend editing a link to Ethan's comment to the top of the post - I think people could easily lead with a misleading impression otherwise

Oh that's sketchy af lol. Thanks!

I do not think that that link is a helpful resource for figuring out the implications of the news right now. I would be very surprised if Bloomberg were that on the ball!

See my other comment - it turns out to be the boring fact that there's a large constant offset in the GPT-Neo embeddings. If you subtract the mean of the GPT-Neo embed it looks normal. (Though the fact that this exists is interesting! I wonder what that direction is used for?)

What do you mean about MLP0 being basically part of the embed btw? There is no MLP before the first attention layer right?

I mean that, as far as I can tell (medium confidence) attn0 in GPT-2 isn't used for much, and MLP0 contains most of the information about the value of the token at each position. Eg, ablating MLP0 completely kills performance, while ablating other MLPs doesn't. And generally the kind of tasks that I'd expect to depend on tokens depend substantially on MLP0

Thanks for clarifying your position, that all makes sense.

I'd argue that most of the updating should already have been done already, not even based on Chris Olah's work, but on neuroscientists working out things like the toad's prey-detection circuits.

Huh, can you say more about this? I'm not familiar with that example (though have a fairly strong prior on there being at best a weak association between specific neuroscience results + specific AI interp results)

Just dug into it more, the GPT-Neo embed just has a large constant offset. Average norm is 11.4, norm of mean is 11. Avg cosine sim is 0.93 before, after subtracting the mean it's 0.0024 (avg absolute value of cosine sim is 0.1831)

Load More