Home

Summary

Schedule

Audience

Organizers

References

AI Accelerators at Scale –
Architecture, Design, Resilience, and Operational Challenges

Sunday, October 19, 2025
Tutorial held in conjunction with MICRO 58
Location: Ivy (President Hotel, 19F)

Organizer: Dimitris Gizopoulos (University of Athens)

Presenters: Mahesh Maddury, Olivia Wu, Harish D. Dixit (Meta), Dimitris Gizopoulos, Odysseas Chatzopoulos (University of Athens)

Tutorial Summary

The “new golden age of computer architecture” as David Patterson and John Hennessy labeled it, requires the architecture and design of high-performance specialized accelerators (domain-specific architectures – DSAs) that focus on specific computational tasks. Artificial Intelligence Accelerators (AIAs) for complex models training and inference is the single class of specialized accelerators which has attracted the attention of the computing industry in the last few years. AI Accelerators must be inherently designed to effectively scale at arbitrary dimensions to keep up with the galloping increase in the models’ size and cardinality of the important data-driven computing problems they are designed to address.

This tutorial is dedicated to the challenges that AI Accelerators deployed at very large scale are facing. The challenges include: (a) architecting the AIA chips for flexibility and performance to tackle the diverging needs of different AI workloads, (b) silicon design of the AIA chips for maximum performance within power and energy envelopes, (c) operational requirements stemming from the scale itself – reliability and the need to tolerate errors due to inherently non-perfect silicon that AI workloads push to its limits.

The presenters of the tutorial come from a major computing systems hyperscaler of our times – Meta – and an academic group specialized on simulation-based design space exploration for performance and reliability – University of Athens. The two teams collaborate directly and in the context of the Open Compute Project (OCP) academic initiative on silent data corruptions at scale.

The Meta team presents a systematic methodology for building reliable MTIA (Meta Training and Inference Accelerator) infrastructure through generational design evolution. will share details about the architecture and design of its specialized MTIA chip, and in particular:

Architectural details for the MTIA chips, Meta’s inference and training accelerators.
Practical microarchitecture implementation strategies including design telemetry, reliability aware scheduling, error resilient memory subsystem, and hardware accelerated check-pointing mechanism.
Quantitative insights from testing different microarchitectural reliability mechanisms.
Discuss trade-offs between performance overhead, silicon area costs and resilience capabilities.
Reliability at scale and how features are designed for large scale deployment and silent data corruption mitigation.
Operational challenges related to MTIA deployment at scale.

The University of Athens team will present its simulation-based design space exploration approach for programmable AI Accelerators (along the research lines followed for CPUs and GPUs), and in particular:

Programmable AI Accelerators efficient modeling for flexible design space exploration.
Balancing hardware modeling accuracy and simulation throughput.
Pitfalls in modeling at other layers of abstraction than the microarchitecture level.
Performance design space exploration for AIAs of different architectures.
Reliability design space exploration for AIAs with a focus on silent data corruptions (SDCs).
Methodology for the estimation of the SDCs incidents rate of different workloads running on AIAs

Schedule

Duration: 8:00 am – 12:00 pm (coffee and lunch breaks as planned by the MICRO organizers)

Target Audience

This tutorial is designed for researchers, engineers, and practitioners in the fields of computer architecture and computer systems focusing on the architecture and design of specialized AI accelerators and the performance, reliability, and other operational challenges of deploying them at scale. Attendees should have a basic understanding of microprocessor architecture/microarchitecture, machine learning, digital systems design, systems software, modeling and simulation, and the basics of hardware and systems reliability and fault tolerance.

Short Bios

Mahesh Maddury is the lead architect for MTIA at Meta. Prior to his current role, he has held senior technical leadership roles at early-stage startups and established companies like Cisco, Brocade and Intel where he has delivered multiple high-performance ASICs. Mahesh holds an MSEE from the University of Colorado, Boulder and is the author of more than 25 patents.

Olivia Wu is the MTIA Design Lead at Meta, where she leads the design and development of AI inference and training accelerators tailored for Meta’s data center workloads. Prior to Meta, she held senior technical leadership roles at Intel, Nervana Systems, Cisco, Pocket Networks, and Sun Microsystems, delivering multiple large-scale, high-performance server processors, networking and AI training ASICs. She holds an MSEE from Purdue University and 16 patents.

Harish D. Dixit is a Principal Engineer at Meta. Harish leads Meta’s efforts on reliability, performance and analytics for Meta Silicon. Harish previously led Meta’s infrastructure scale-up journey to millions of servers for the last 7 years. Harish’s key focus areas include silent data corruption, AI cluster reliability, lifecycle optimizations and sustainability initiatives across Meta’s production applications. Harish has 20+ patent filings in system architecture.

Dimitris Gizopoulos is Professor at the University of Athens leading the Computer Architecture Lab. The group's research focuses on dependability, energy-efficiency, and performance of computer architectures, built on CPUs, GPUs, and AI Accelerators (AIAs). Gizopoulos has published more than 200 papers in conferences and journals, serves as Associate Editor for several IEEE, ACM, and Springer Transactions and Magazines and as member of Program, Organizing and Steering Committees of IEEE and ACM conferences. Gizopoulos is an IEEE Fellow, a Golden Core member of the IEEE Computer Society and a Distinguished ACM member.

Odysseas Chatzopoulos is a PhD student in the Computer Architecture Lab of the University of Athens, and he holds a Computer Science degree from the same university. His research focuses on modeling and simulation for energy-efficiency and dependability for CPUs and domain specific architectures including AI Accelerators.

Related Projects

Research Supported by

References

“Meta's Second-Generation AI Chip: Model-Chip Co-Design and Productionization Experiences”, J. Coburn et al, ACM/IEEE International Symposium on Computer Architecture (ISCA 2025), Tokyo, Japan, June 2025.
“MTIA v1: Meta’s first-generation AI inference accelerator”, A. Firoozshahian, O. Wu, J. Coburn, R. Levenstei, Meta AI Blog, May 2023.
“Hardware Sentinel: Protecting Software Applications from Hardware Silent Data Corruptions”, R. Dutta, et al, ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 2025), Rotterdam, The Netherlands, March 2025.
“Dr. DNA: Combating Silent Data Corruptions in Deep Learning using Distribution of Neuron Activations”, D. Ma, et al, ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 2024), San Diego, CA, USA, April 2024.
“The Dark Side of Computing: Silent Data Corruptions”, D. Gizopoulos, IEEE Computer Magazine, vol. 58, no. 6, pp. 101–106, June 2025.
“Accurate Analysis of Silent Data Corruptions in Programmable AI Accelerator Microarchitectures”, O. Chatzopoulos, M. Trakosa, D. Gizopoulos, IEEE International Symposium on On-Line Testing and Robust System Design (IOLTS 2025), Ischia, Italy, July 2025.
“BABIS: Exploring the Microarchitectures of Programmable AI Accelerators for Silent Data Corruptions”, O. Chatzopoulos, M. Trakosa, D. Gizopoulos, 9th Workshop on Cognitive Architectures 2025 (in conjunction with ISCA 2025) (CogArch 2025), Tokyo, Japan, June 2025.
“Veritas: Demystifying Silent Data Corruptions: μArch-Level Modeling and Fleet Data of Modern x86 CPUs”, O. Chatzopoulos, N. Karystinos, G. Papadimitriou, D. Gizopoulos, H. D. Dixit, and S. Sankar, IEEE International Symposium on High-Performance Computer Architecture (HPCA 2025), Las Vegas, USA, March 2025.
“From Gates to SDCs: Understanding Fault Propagation Through the Compute Stack”, O. Chatzopoulos, G. Papadimitriou, D. Gizopoulos, H. D. Dixit, and S. Sankar, Design, Automation, and Test in Europe, Lyon, France, March 2025.
“Harpocrates: Breaking the Silence of CPU Faults through Hardware-in-the-Loop Program Generation”, N. Karystinos, O. Chatzopoulos, G. Fragkoulis, G. Papadimitriou, D. Gizopoulos, and S. Gurumurthi, ACM/IEEE International Symposium on Computer Architecture (ISCA 2024), Buenos Aires, Argentina, June 2024.
“GPU Reliability Assessment: Insights Across the Abstraction Layers”, L. Yang, G. Papadimitriou, D. Sartzetakis, A. Jog, E. Smirni, and D. Gizopoulos, IEEE International Conference on Cluster Computing (CLUSTER 2024), Kobe, Japan, September 2024.
“AVGI: Microarchitecture-Driven, Fast and Accurate Vulnerability Assessment”, G. Papadimitriou and D. Gizopoulos, HPCA 2023, Montreal, QC, Canada, February 2023.
“gpuFI-4: A Microarchitecture-Level Framework for Assessing the Cross-Layer Resilience of Nvidia GPUs”, D. Sartzetakis, G. Papadimitriou, and D. Gizopoulos, ISPASS 2022, Singapore, May 2022.
“Demystifying the System Vulnerability Stack: Transient Fault Effects Across the Layers”, G. Papadimitriou and D. Gizopoulos, ISCA 2021, June 2021.
“Demystifying Soft Error Assessment Strategies on ARM CPUs: Microarchitectural Fault Injection vs. Neutron Beam Experiments”, A. Chatzidimitriou, P. Bodmann, G. Papadimitriou, D. Gizopoulos, and P. Rech, DSN 2019, Portland, OR, USA, June 2019.
“RT Level vs. Microarchitecture Level Reliability Assessment: Case Study on ARM Cortex-A9 CPU”, A. Chatzidimitriou, M. Kaliorakis, D. Gizopoulos, M. Iacaruso, M. Pipponzi, R. Mariani, S. Di Carlo, DSN 2017, Denver, CO, USA, June 2017.
“Silent Data Corruptions: The Stealthy Saboteurs of Digital Integrity”, G. Papadimitriou, D. Gizopoulos, H. D. Dixit, S. Sankar, IOLTS 2023, July 2023.
“Silent Data Corruptions: Microarchitectural Perspectives”, G. Papadimitriou and D. Gizopoulos, IEEE Transactions on Computers, June 2023.
“Silent Data Corruptions at Scale”, H. D. Dixit, S. Pendharkar, M. Beadon, C. Mason, T. Chakravarthy, B. Muthiah, S. Sankar, arXiv:2102.11245, February 2021.
“Silent Data Corruptions in Computing Systems: Early Predictions and Large-Scale Measurements”, D. Gizopoulos, G. Papadimitriou, O. Chatzopoulos, N. Karystinos, H. Dixit, and S. Sankar, IEEE European Test Symposium (ETS 2024), The Hague, Netherlands, May 2024.
“Silent Data Corruptions in Computing: Understand and Quantify”, T. Macieira, S. Gurumurthy, S. Gurumurthi, A. Haggag, G. Papadimitriou, and D. Gizopoulos, IEEE International Symposium on On-Line Testing and Robust System Design (IOLTS 2024), Rennes, France, July 2024.
“Detecting silent data corruptions in the wild”, H. Dixit, L. Boyle, G. Vunnam, S. Pendharkar, M. Beadon, S. Sankar, arXiv:2203.08989, March 2022.
“Anatomy of On-Chip Memory Hardware Fault Effects Across the Layers”, G. Papadimitriou and D. Gizopoulos, IEEE Transactions on Emerging Topics in Computing, Volume: 11, Issue: 2, pp. 420–431, April–June 2023.
“SDCs: A B C”, Dimitris Gizopoulos, Computer Architecture Today blog, September 16, 2024, SIGARCH.

Created by Computer Architecture Lab @ UoA
This work is licensed under a CC License

Contact information:
University of Athens – Dept. of Informatics and Telecommunications

Address:
Panepistimiopolis, Ilissia
Athens, Greece, GR 157 84

Phone: +30 210 727 5145
Email: dgizop AT di DOT uoa DOT gr

AI Accelerators at Scale – Architecture, Design, Resilience, and Operational Challenges

Tutorial Summary

Schedule

Target Audience

Short Bios

Related Projects

Research Supported by

References

AI Accelerators at Scale –
Architecture, Design, Resilience, and Operational Challenges