Collaborative Research: III: Medium: Advancing Large Language Model Unlearning: Foundations and Applications
openNSF
Large language models (LLMs) are increasingly integrated into daily life, powering applications in education, healthcare, code generation, and more. However, these models can inadvertently memorize and reproduce sensitive or harmful content, including private user data, copyrighted material, and unsafe instructions. Retraining LLMs from scratch to remove such content is often impractical due to high cost and complexity. This project charts a new course toward more controllable, debuggable, and secure artificial intelligence (AI) through LLM unlearning, a paradigm that enables the targeted removal of harmful data influences and behaviors from pretrained models without compromising their overall performance. The research advances national priorities by promoting trustworthy AI, strengthening data privacy, ensuring safe deployment across sectors such as cybersecurity, healthcare, and education, and enabling contextually-adaptive systems aligned with a wide range of social norms. The project also offers strong educational and outreach opportunities, including curriculum development, research dissemination through workshops, tutorials, publications, and open-source software, as well as the creation of inclusive mentoring programs.
This project aims to establish a comprehensive foundation for LLM unlearning by addressing challenges across four interconnected areas: optimization, model, data, and application. On the optimization front, it develops new algorithmic frameworks to enhance the effectiveness, robustness, and efficiency of LLM unlearning. At the model level, it investigates how internal components of LLMs contribute to memorization, introducing interpretability-driven approaches to identify and adjust influential weights without compromising essential capabilities such as LLMs' "emergent" abilities. On the data side, the project examines the role of watermarking and coreset selection in shaping unlearning outcomes, advancing methods for handling imperfect or proxy forget sets. These innovations are applied to privacy-sensitive scenarios such as conversational risk assessment in online dating, enabling LLMs to evaluate behavioral risks while erasing personal information. Conducted by a multidisciplinary team with a strong track record in trustworthy machine learning, the project is expected to deliver principled algorithms, practical tools, and rigorous benchmarks that advance responsible, adaptable, and secure AI systems.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
Up to $266K
machine learningEducationsocial science