The Goal Misgeneralization Problem

Goal misgeneralization occurs when a reinforcement learning agent retains its capabilities out-of-distribution yet pursues the wrong goal. How can we prevent or detect goal misgeneralization?

This contest has closed. See the 2023 winners here.

Background

(Explanation paraphrased from Langosco et al., 2021 and Shah et al., 2022)

Say an AI system is trained to make decisions about buying or selling penny stocks in response to market movements. Then the AI system faces a related, but different challenge: trading currencies, or S&P 500 stocks. How will it behave? We might naively imagine that the AI 'learned' to make trades that make money, and so it has the goal of 'making profitable trades' and will keep doing that. And it might! But often, in training an AI, we aren't reinforcing precisely the behavior we thought we were reinforcing, we didn't give the AI the goal we thought we gave it, and the AI's out of distribution behavior will be unpredictable or badly wrong. For example, maybe instead of learning that it's good when stocks gain value in general, it learned that it's good when the value of penny stocks gets closer to $1.

Langosco et al. (2021) call this problem goal misgeneralization, and it runs surprisingly deep: current techniques for training ML systems make it very hard to be sure that you gave the system the goal you wanted them to have, and the gaps in your effort often only show up once the system faces inputs out of the distribution it was trained on.

Worryingly, goal misgeneralization may become increasingly dangerous. This is because regardless of what terminal goal an AI system learns, it seems likely that it will also learn instrumental goals such as to seek power, acquire resources, deceive operators, and avoid modification or shutdown. It also seems likely that capabilities will generalize further than alignment – unless we can come up with reliable ways to avoid these failure modes. 

If goal misgeneralization is not solved, we may soon be at risk of developing advanced AI systems that learn goals so dangerous that they could end humanity, but be unable to detect them until deployment. This contest is meant to promote progress on this problem. 

Instructions

  1. Read. Read the paper Goal Misgeneralization in Deep Reinforcement Learning, as well as DeepMind’s blog post and paper about goal misgeneralization.

  2. Brainstorm. Brainstorm ideas for how to prevent or detect goal misgeneralization. Think about how they might fail. Think about what experiments you could run to test your hypotheses.

  3. Write. Write up your best idea. At minimum, your write-up should include a 500 word abstract/summary with

    • Your idea for preventing or detecting or goal misgeneralization. It may be empirical or purely theoretical.

    • A description of how your idea addresses goal misgeneralization failures.

    • A description of the limitations of your idea, assumptions it relies on, and ways it might fail.

    In addition to your abstract/summary, you may submit a PDF with a longer write-up, research paper, code, math, graphics, etc. with no word limit.

  4. Submit. Upload your submission here.

Submission Criteria

We’re interested in submissions that do at least one of the following:

  1. Propose techniques for preventing or detecting goal misgeneralization

  2. Propose ways for researchers to identify when goal misgeneralization is likely to occur

  3. Identify new examples of goal misgeneralization in RL or non-RL domains. For example:

    • We might train an imitation learner to imitate a "non-consequentialist" agent, but it actually ends up learning a more consequentialist policy 

    • We might train an agent to be myopic (e.g., to only care about the next 10 steps), but it actually learns a policy that optimizes over a longer timeframe

  4. Suggest other ways to make progress on goal misgeneralization

For answers to more questions, see our FAQ.

See official rules here and our privacy policy here.

NO PURCHASE NECESSARY.  This contest is open to legal residents of 50 United States or D.C., Canada, or U.K., and who are age 13 or older.  Void in Puerto Rico, USVI, Guam, Quebec and where prohibited.  For complete official rules, including all eligibility criteria, entry information, & prizes, read here.  The contest begins November 22, 2022 at 12:00 AM ET & ends May 1, 2023 at 11:59 PM ET.  ARV of all prizes: $250,000.00.  Sponsor: AI Alignment Awards, a project of Players Philanthropy Fund.  For questions, write info@alignmentawards.com.