Feedback welcomed: www.admonymous.co/zeshen
Thanks for the explanation!
In every scenario, if you have a superintelligent actor which is optimizing the grader's evaluations while searching over a large real-world plan space, the grader gets exploited.
Similar to the evaluator-child who's trying to win his mom's approval by being close to the gym teacher, how would grader exploitation be different from specification gaming / reward hacking? In theory, wouldn't a perfect grader solve the problem?
In case anyone comes across this post trying to understand the field, Scott Aaronson did a better job at me at describing the "seed AI" and "prosaic AI" paradigms here, which he calls "Orthodox" vs "Reform".
I'm probably missing something, but doesn't this just boil down to "misspecified goals lead to reward hacking"?
This post makes sense to me though it feels almost trivial. I'm puzzled by the backlash against consequentialism, it just feels like people are overreacting. Or maybe the 'backlash' isn't actually as strong as I'm reading it to be.
I'd think of virtue ethics as some sort of equilibrium that society has landed ourselves in after all these years of being a species capable of thinking about ethics. It's not the best but you'd need more than naive utilitarianism to beat it (this EA forum post feels like commonsense to me too), which you describe as reflective consequentialism. It seems like it all boils down to: be a consequentialist, as long as you 1) account for second-order and higher effects, and 2) account for bad calculation due to corrupted hardware.
Thanks - this helps.
Thanks for the reply!
But I think you can come up with clean examples of capabilities failures if you look at, say, robots that use search to plan; they often do poorly according to the manually specified reward function on new domains because optimizing the reward is too hard for its search algorithm.
I'd be interested to see actual examples of this, if there are any. But also, how would this not be an objective robustness failure if we frame the objective as "maximize reward"?
if you perform Inverse Optimal Control on the behavior of the robot and derive a revealed reward function, you'll find that its
Do you mean to say that its reward function will be indistinguishable from its policy?
there doesn't seem to be a super principled way of dividing up capabilities and preferences in the first place.
Interesting paper, thanks! If a policy cannot be decomposed into a planning algorithm and a reward function anyway, it's unclear to me why 2D-robustness would be a better framing of robustness than just 1D-robustness.
Thanks for the example, but why this is a capabilities robustness problem and not an objective robustness problem, if we think of the objective as 'classify pandas accurately'?
I don't know how I even got here after so long but I really like this post. Looking forward to next year's post.
I'd love to see a post with your reasonings.