November 3, 2022, Filed Under: 2022 Fall Seminar, SeminarsHBM3 RAS: The Journey to Enhancing Die-Stacked DRAM Resilience at Scale Title: HBM3 RAS: The Journey to Enhancing Die-Stacked DRAM Resilience at Scale Speaker: Sudhanva Gurumurthi (AMD) Date: November 8, 2022 at 3:30 pm Location: EER 3.646 or Zoom Abstract:HBM3 is the next-generation technology of the JEDEC High Bandwidth Memory™ DRAM standard. HBM3 is expected to be widely used in future SoCs to accelerate data center and automotive workloads. Reliability, Availability, and Serviceability (RAS) are key requirements in most of these computing domains and use cases, and essential for attaining sufficient resilience at scale. In the first part of the talk, we will review some key terminology and concepts, explain the set of RAS challenges that was facing HBM3, and certain key considerations for standardization. Data and analyses will be presented that justified the need for a new RAS architecture for HBM3. Next, we will present the overall solution space that was explored, the specific direction taken for HBM3, and explain why this path was chosen. Finally, the details of the HBM3 RAS architecture and an evaluation of its resilience at scale will be presented. Speaker Bio:Sudhanva Gurumurthi is a Fellow at AMD, where he leads advanced development in RAS. Prior to joining industry, Sudhanva was an Associate Professor with tenure in the Computer Science Department at the University of Virginia. He is a recipient of an NSF CAREER Award, a Google Focused Research Award, an IEEE Computer Society Distinguished Contributor recognition, and several other awards and recognitions. Sudhanva has served as an editor for the IEEE Micro Top Picks from Computer Architecture Conferences special issue, IEEE Transactions on Computers, and IEEE Computer Architecture Letters. He also serves on the Advisory Council of the College of Science and Engineering at Texas State University. Sudhanva received his PhD in Computer Science and Engineering from Penn State in 2005.
October 8, 2022, Filed Under: 2022 Fall Seminar, SeminarsAccelerating the Pace of AWS Inferentia Chip Development, From Concept to End Customer Use Speaker: Randy Huang, Amazon (AWS) Date: October 18, 2022 at 3:30pm Location: EER 3.646 Abstract: In this talk, I will detail the process and the decisions we have made to bring AWS Inferentia from a one-page press release to general availability. Our process starts with working backward from the customers and how we could bring real benefits to customers’ use cases. We will show that by separating out 1-way vs. 2-way door decisions, we can navigate technical and strategic decisions at AWS velocity and bring a deep-learning accelerator to the marketplace quickly. Bio: Randy is a principal engineer of Inferentia and Trainium, custom chips designed by AWS to enable highly cost-effective low latency inference and training performance at any scale. Prior to joining AWS, he led the architecture group at Tabula, designing and building three dimensional field programmable gate arrays (3-D FPGAs). Randy received his Ph.D. from University of California, Berkeley.
September 17, 2022, Filed Under: 2022 Fall Seminar, SeminarsBenchmarking a New Paradigm: An Experimental Analysis of a Real Processing-in-Memory Architecture Speaker: Juan Gómez Luna el1goluj@gmail.com Date: September 20, 2022 at 3:30pm Location: EER 3.646 or Zoom Abstract: Many modern workloads, such as neural networks, databases, and graph processing, are fundamentally memory-bound. For such workloads, the data movement between main memory and CPU cores imposes a significant overhead in terms of both latency and energy. A major reason is that this communication happens through a narrow bus with high latency and limited bandwidth, and the low data reuse in memory-bound workloads is insufficient to amortize the cost of main memory access. Fundamentally addressing this data movement bottleneck requires a paradigm where the memory system assumes an active role in computing by integrating processing capabilities. This paradigm is known as processing-in-memory (PIM).Recent research explores different forms of PIM architectures, motivated by the emergence of new 3D-stacked memory technologies that integrate memory with a logic layer where processing elements can be easily placed. Past works evaluate these architectures in simulation or, at best, with simplified hardware prototypes. In contrast, the UPMEM company has designed and manufactured the first publicly-available real-world PIM architecture. The UPMEM PIM architecture combines traditional DRAM memory arrays with general-purpose in-order cores, called DRAM Processing Units (DPUs), integrated in the same chip.This paper provides the first comprehensive analysis of the first publicly-available real-world PIM architecture. We make two key contributions. First, we conduct an experimental characterization of the UPMEM-based PIM system using microbenchmarks to assess various architecture limits such as compute throughput and memory bandwidth, yielding new insights. Second, we present PrIM (Processing-In-Memory benchmarks), a benchmark suite of 16 workloads from different application domains (e.g., dense/sparse linear algebra, databases, data analytics, graph processing, neural networks, bioinformatics, image processing), which we identify as memory-bound. We evaluate the performance and scaling characteristics of PrIM benchmarks on the UPMEM PIM architecture, and compare their performance and energy consumption to their state-of-the-art CPU and GPU counterparts. Our extensive evaluation conducted on two real UPMEM-based PIM systems with 640 and 2,556 DPUs provides new insights about suitability of different workloads to the PIM system, programming recommendations for software designers, and suggestions and hints for hardware and architecture designers of future PIM systems. Bio: Juan Gómez-Luna is a senior researcher and lecturer at SAFARI Research Group @ ETH Zürich. He received the BS and MS degrees in Telecommunication Engineering from the University of Sevilla, Spain, in 2001, and the PhD degree in Computer Science from the University of Córdoba, Spain, in 2012.Between 2005 and 2017, he was a faculty member of the University of Córdoba. His research interests focus on processing-in-memory, memory systems, heterogeneous computing, and hardware and software acceleration of medical imaging and bioinformatics. He is the lead author of PrIM (https://github.com/CMU-SAFARI/prim-benchmarks), the first publicly-available benchmark suite for a real-world processing-in-memory architecture, and Chai (https://github.com/chai-benchmarks/chai), a benchmark suite for heterogeneous systems with CPU/GPU/FPGA.