Today, there is an increasing need for running powerful Artificial Intelligence (AI) models on mobile phones. Many of the latest generation of AI models (including ChatGPT and Gemini) follow what is known as the transformer architecture. Similar to what has been seen in optimizing various workloads on different computing devices, a class of optimizations related to memory hierarchy is extremely important for the efficient execution of transformer-based models on modern mobile devices. This project is based on the premise that features of these workloads and characteristics of mobile devices require not only the application of existing techniques from compiler literature but also the development of new methods. The project’s novelties are in considering such workload and architecture combination and proposing techniques related to choosing new layouts, removing redundant layout changes that slow down execution, performing memory allocation judiciously to improve performance, and dealing with the newest accelerators. The project’s impacts are helping bring the latest advances in Artificial Intelligence (AI) on mobile and edge devices, letting these advances reach more individuals, and contributing to compiler and runtime support literature by developing new methods. <br/><br/>In targeting memory hierarchy-related optimizations for transformers, we observe that compared to the previous generation of deep learning-based models, transformers have more data flow splits, shuffles, merges, and transpose/reshape(-like) operations. Thus, various compilation systems targeting deep learning developed in the past decade fall short with respect to memory-related transformations, especially with a global view of the problem. This project builds on top of the investigators’ recent work developing a comprehensive framework for removing relayout operations and delivering significantly better performance for transformer models. Building on this work, the following agenda is being undertaken: Performance (Cost) Models -- a detailed performance model for execution on mobile GPUs is being developed, which will especially be novel in capturing the locality behavior of a 2.5D cache; Formal Approaches to Transformations – more formal approaches for the same set of optimizations (e.g., replacing a relayout operator) are undertaken, including both polyhedral formulation and computation-data graph-based approaches; Layout Transformation in View of New Instructions – as newer processors are increasingly offering matrix (or tensor)-based instructions, which have their own specific data layout requirements, memory performance-related problems in view of these requirements are undertaken; and Memory Management for Dynamic Models --focusing on emerging dynamic models, computation ordering, memory allocation, and memory fragmentation problems are investigated. The investigators are working towards creating more synergy between compiler research (especially memory/cache modeling and tuning) and ML-model development communities. The research on large-scale Machine Learning and Deep Learning transformation/implementation techniques is to be incorporated into courses taught by the investigators.<br/><br/>This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.