A prerequisite to making AI systems safe and reliable is to get them to do what we, as humans, want. The focus of this project is to enable the safe deployment of learning-enabled systems that learn objectives from human feedback and then robustly optimize their behavior under these learned objectives. What humans want is often highly ambiguous and uncertain, so we need AI systems that are robust to this uncertainty. However, most prior work on reward learning does not easily facilitate uncertainty assessment. The project's novelties are to develop the first scalable learning methods that are robust to uncertainty, enable self-assessment, and provide basic test cases for assessing AI alignment with human values. The project's impacts are fundamentally new capabilities that will allow AI systems to safely learn models of human intent and enable humans to know with high-confidence whether an AI system will behave correctly with respect to that intent. The broader impacts of making progress on safe and robust human-AI alignment include better domestic robots, recommendation systems, self-driving cars, delivery quadrotors, and large language models (LLMs). The project broadens participation in computing by providing educational outreach opportunities for undergraduate research and K-12 summer AI camps. <br/><br/>The key observation in this project is that AI systems will always face uncertainty when seeking to identify human intent and values. Thus, there is a need for methods that explicitly reason about uncertainty and can provide probabilistic guarantees of robustness under this uncertainty. The project is pursuing the following three specific objectives that will enable safe and robust reward learning: (1) Probabilistic performance bounds when learning policies from human input: the project is developing approaches that allow humans to know with high-confidence whether a learned policy achieves a desired performance threshold when learning a reward function from human feedback. (2) Unit tests for reward and policy alignment: the project is developing tests that verify with high-confidence whether a learned reward function and behavior are correct. (3) Robustness to reward misidentification and misgeneralization: the project is developing techniques that penalize misaligned behavior during policy optimization to ensure the resulting behavior of the AI system does not lead to unintended consequences. The investigators are applying these techniques to reward learning to prevent reward hacking and also to reinforcement learning with a known reward function to overcome the problem of goal misgeneralization.<br/><br/>This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.