Ruby

Team Lead for LessWrong

Sequences

LW Team Updates & Announcements
Novum Organum

Comments

The monthly Open & Welcome thread is a good general place for questions, otherwise you could make a top level question post about it.

Ruby9dΩ24

The Alignment Forum is supposed to be a very high signal-to-noise place for Alignment content, where researchers can trust that all content they read will be material they're interested in seeing (even at the expense of some false negatives).

Answer by RubyMar 13, 202330

Hi,

I'm afraid there isn't at this time, very unfortunately.

Curated. A parable explaining a probability lesson that many would benefit from – what's not to love? I like the format, I found the dialog/parable amusing rather than dry, and I think the point is valuable (and due to the format, memorable). I'll confess that I think this post will have me looking at blends of different forecasts more carefully, especially as regards to actual decision-making (particular regarding AI forecasts which are feeling increasingly relevant to decision-making these day).

Ugh, is this your homework? Will approve this in case someone feels like answering, but maybe try GPT/Bing

Ruby15dΩ712

The hard part to me now seems to be in crafting some kind of useful standard rather than one in hindsight makes us go "well that sure have everyone a false sense of security".

I do like this post a lot, unfortunately it doesn't seem to resonate with many people.

I want to push back on anyone downvoting this because it's sexist, dehumanizing, and othering (rather than just being a bad model). I am sad if a model/analogy has those negative effect, but supposing the model/analogy in fact held and was informative, I think we should be able to discuss it. And even the possibility that something in the realm of gender relations has relevant lessons for Alignment seems like we should be able to discuss it.

Or alternatively stated, I want to push for Decoupling norms here.

Curated. This post feels timely and warranted given the current climate. I think we, in our community, were already at some risk of throwing out our minds a decade ago, but it was less when it was easy to think timelines were 30-60 years. That allowed more time for play. Now as there's so much evidence of imminence and there are more people doing more things, AI x-risk isn't a side interest for many but a full-time occupation, yes, I think we're almost colluding in creating a culture that doesn't allow time for play. I like that this post makes the case for pushing back.

Further, this post points at something I want to reclaim for the spirit of LessWrong, something I feel like use to be more palpable than now. Random posts about There’s no such thing as a tree (phylogenetically) or random voting theory posts felt rooted in this kind play – the raw interest and curiosity of the author rather than some urgent importance of the topic. The concerns I have that make me want to boost the default prominence of rationality and world modeling posts (see LW Filter Tags (Rationality/World Modeling now promoted in Latest Posts) is not that I don't like the AI posts, but in large part that I want to see more of the playful posts of yore.

People should be thinking about:

  • If you get truly choose your own work, is your judgment on what will help with alignment good? (this might be true for senior hires like evhub, unsure about others getting to choose)
  • If you are joining existing alignment teams, is their work actually good for reducing AI x-risk vs the opposite? For example, both OpenAI and Anthropic do some variant of RLHF, which is pretty controversial – as a prospective hire, have you formed a solid opinion on this question vs relying on the convenient answer that at least some people regard it as alignment?
  • What is the likelihood that you are asked/pressured to do different work that is net negative, or that your work is coopted in that direction? Perhaps RLHF is useful alignment research, but it also pushes on commercialization bottlenecks and fuels arms races between Google and Microsoft. That's a "second order" effect that you don't want to ignore. It takes a lot of courage to ignore pressure from a company providing you with your job once you've taken the role.
    • More generally, I don't think there's a hard line between alignment and capabilities. I expect  (not that I'm that knowledgable) that much alignment work (particularly interpretability) will fuel capabilities gains before it helps with alignment. I think anyone doing this work ought to think about it.
  • I have seen an abuser befriend people who are trusted and whom the abuser is nice to. This gives them credibility to harm others and then have rumors/accusations doubted, because hey, they're friends with such upstanding people. I worry about a kind "safetywashing" where are a company that is overall doing harm, makes it self look better by putting out some genuinely good alignment work. The good alignment appearance maintains a good reputation which allows for recruiting capable and valuable talent, getting more investment, etc.
    • I think this is a way in which one's work can locally be good but via pretty significant second order effects be very negative.
       

Personally, if you working with cutting edge LLMs, you need to pass a high burden of proof/reasoning that this is good. Incentives like prestige, salary, and "meaning" means ought to question oneself pretty hard when doing the equivalent of entering the nuclear bombs or conventional arms manufacturing industries (especially during war times).

Load More