SLES: High-Confidence Guarantees for Safe Reward and Policy Learning Under Uncertainty

Information

  • NSF Award
  • 2416761
Owner
  • Award Id
    2416761
  • Award Effective Date
    8/15/2024 - 6 months ago
  • Award Expiration Date
    7/31/2027 - 2 years from now
  • Award Amount
    $ 439,425.00
  • Award Instrument
    Standard Grant

SLES: High-Confidence Guarantees for Safe Reward and Policy Learning Under Uncertainty

A prerequisite to making AI systems safe and reliable is to get them to do what we, as humans, want. The focus of this project is to enable the safe deployment of learning-enabled systems that learn objectives from human feedback and then robustly optimize their behavior under these learned objectives. What humans want is often highly ambiguous and uncertain, so we need AI systems that are robust to this uncertainty. However, most prior work on reward learning does not easily facilitate uncertainty assessment. The project's novelties are to develop the first scalable learning methods that are robust to uncertainty, enable self-assessment, and provide basic test cases for assessing AI alignment with human values. The project's impacts are fundamentally new capabilities that will allow AI systems to safely learn models of human intent and enable humans to know with high-confidence whether an AI system will behave correctly with respect to that intent. The broader impacts of making progress on safe and robust human-AI alignment include better domestic robots, recommendation systems, self-driving cars, delivery quadrotors, and large language models (LLMs). The project broadens participation in computing by providing educational outreach opportunities for undergraduate research and K-12 summer AI camps. <br/><br/>The key observation in this project is that AI systems will always face uncertainty when seeking to identify human intent and values. Thus, there is a need for methods that explicitly reason about uncertainty and can provide probabilistic guarantees of robustness under this uncertainty. The project is pursuing the following three specific objectives that will enable safe and robust reward learning: (1) Probabilistic performance bounds when learning policies from human input: the project is developing approaches that allow humans to know with high-confidence whether a learned policy achieves a desired performance threshold when learning a reward function from human feedback. (2) Unit tests for reward and policy alignment: the project is developing tests that verify with high-confidence whether a learned reward function and behavior are correct. (3) Robustness to reward misidentification and misgeneralization: the project is developing techniques that penalize misaligned behavior during policy optimization to ensure the resulting behavior of the AI system does not lead to unintended consequences. The investigators are applying these techniques to reward learning to prevent reward hacking and also to reinforcement learning with a known reward function to overcome the problem of goal misgeneralization.<br/><br/>This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

  • Program Officer
    Pavithra Prabhakarpprabhak@nsf.gov7032922585
  • Min Amd Letter Date
    8/28/2024 - 6 months ago
  • Max Amd Letter Date
    8/28/2024 - 6 months ago
  • ARRA Amount

Institutions

  • Name
    University of Utah
  • City
    SALT LAKE CITY
  • State
    UT
  • Country
    United States
  • Address
    201 PRESIDENTS CIR
  • Postal Code
    841129049
  • Phone Number
    8015816903

Investigators

  • First Name
    Daniel
  • Last Name
    Brown
  • Email Address
    dsbrown@cs.utah.edu
  • Start Date
    8/28/2024 12:00:00 AM

Program Element

  • Text
    AI-Safety