Seminars

April 2, 2024, Filed Under: 2024 Spring Seminar, Seminars

How does one bit-flip corrupt an entire deep neural network, and what to do about it

Title: How does one bit-flip corrupt an entire deep neural network, and what to do about it

Speaker: Yanjing Li (Department of Computer Science at the University of Chicago)

Date: April 16, 2024 at 3:30 pm

Location: EER 3.640/3.642 or Zoom

Abstract: Deep neural networks are increasingly susceptible to hardware failures. The impact of hardware failures on these workloads is severe – even a single bit-flip can corrupt an entire network during both training and inference. The urgency of tackling this challenge, known as the Silent Data Corruption challenge in a broader context, has been widely raised by both the industry and academia.

In this talk, I will first present the first in-depth resilience study targeting DNN workloads and hardware failures that occur in the logic portion of deep learning accelerator systems, including a comprehensive characterization of hardware failure effects, along with the fundamental understanding of how hardware failures propagate in hardware devices and interact with the workloads. Next, based on the insights obtained from our study, I will present ultra lightweight yet highly effective techniques to mitigate hardware failures in deep learning accelerator systems.

Speaker Bio: Yanjing Li is an Assistant Professor in the Department of Computer Science at the University of Chicago. Prior to joining the university, she was a senior research scientist at Intel Labs. Professor Li received her Ph.D. in Electrical Engineering from Stanford University, an M.S. in Mathematical Sciences (with honors) and a B.S. in Electrical and Computer Engineering (with a double major in Computer Science) from Carnegie Mellon University.

Professor Li has received various awards, including the NSF CAREER Award, DAC under-40 innovators award, Google research scholar award, Intel Labs Gordy academy award (highest honor in Intel Labs) and several other Intel recognition awards, outstanding dissertation award (European Design and Automation Association), and multiple best paper awards (for example, IEEE VLSI Test Symposium and IEEE International Test Conference). Professor Li’s seminal work on in-field self-test and diagnostics has been adopted by various companies including Intel, Nvdia, Synopsys, TI, and MentorGraphics, and more.

November 3, 2022, Filed Under: 2022 Fall Seminar, Seminars

HBM3 RAS: The Journey to Enhancing Die-Stacked DRAM Resilience at Scale

Title: HBM3 RAS: The Journey to Enhancing Die-Stacked DRAM Resilience at Scale

Speaker: Sudhanva Gurumurthi (AMD)

Date: November 8, 2022 at 3:30 pm

Location: EER 3.646 or Zoom

Abstract:
HBM3 is the next-generation technology of the JEDEC High Bandwidth Memory™ DRAM standard. HBM3 is expected to be widely used in future SoCs to accelerate data center and automotive workloads. Reliability, Availability, and Serviceability (RAS) are key requirements in most of these computing domains and use cases, and essential for attaining sufficient resilience at scale. In the first part of the talk, we will review some key terminology and concepts, explain the set of RAS challenges that was facing HBM3, and certain key considerations for standardization. Data and analyses will be presented that justified the need for a new RAS architecture for HBM3. Next, we will present the overall solution space that was explored, the specific direction taken for HBM3, and explain why this path was chosen. Finally, the details of the HBM3 RAS architecture and an evaluation of its resilience at scale will be presented.

Speaker Bio:
Sudhanva Gurumurthi is a Fellow at AMD, where he leads advanced development in RAS. Prior to joining industry, Sudhanva was an Associate Professor with tenure in the Computer Science Department at the University of Virginia. He is a recipient of an NSF CAREER Award, a Google Focused Research Award, an IEEE Computer Society Distinguished Contributor recognition, and several other awards and recognitions. Sudhanva has served as an editor for the IEEE Micro Top Picks from Computer Architecture Conferences special issue, IEEE Transactions on Computers, and IEEE Computer Architecture Letters. He also serves on the Advisory Council of the College of Science and Engineering at Texas State University. Sudhanva received his PhD in Computer Science and Engineering from Penn State in 2005.

October 8, 2022, Filed Under: 2022 Fall Seminar, Seminars

Accelerating the Pace of AWS Inferentia Chip Development, From Concept to End Customer Use

Speaker: Randy Huang, Amazon (AWS)

Date: October 18, 2022 at 3:30pm

Location: EER 3.646

Abstract: In this talk, I will detail the process and the decisions we have made to bring AWS Inferentia from a one-page press release to general availability. Our process starts with working backward from the customers and how we could bring real benefits to customers’ use cases. We will show that by separating out 1-way vs. 2-way door decisions, we can navigate technical and strategic decisions at AWS velocity and bring a deep-learning accelerator to the marketplace quickly.

Bio: Randy is a principal engineer of Inferentia and Trainium, custom chips designed by AWS to enable highly cost-effective low latency inference and training performance at any scale. Prior to joining AWS, he led the architecture group at Tabula, designing and building three dimensional field programmable gate arrays (3-D FPGAs). Randy received his Ph.D. from University of California, Berkeley.

September 17, 2022, Filed Under: 2022 Fall Seminar, Seminars

Benchmarking a New Paradigm: An Experimental Analysis of a Real Processing-in-Memory Architecture

Speaker: Juan Gómez Luna el1goluj@gmail.com

Date: September 20, 2022 at 3:30pm

Location: EER 3.646 or Zoom

Abstract: Many modern workloads, such as neural networks, databases, and graph processing, are fundamentally memory-bound. For such workloads, the data movement between main memory and CPU cores imposes a significant overhead in terms of both latency and energy. A major reason is that this communication happens through a narrow bus with high latency and limited bandwidth, and the low data reuse in memory-bound workloads is insufficient to amortize the cost of main memory access. Fundamentally addressing this data movement bottleneck requires a paradigm where the memory system assumes an active role in computing by integrating processing capabilities. This paradigm is known as processing-in-memory (PIM).
Recent research explores different forms of PIM architectures, motivated by the emergence of new 3D-stacked memory technologies that integrate memory with a logic layer where processing elements can be easily placed. Past works evaluate these architectures in simulation or, at best, with simplified hardware prototypes. In contrast, the UPMEM company has designed and manufactured the first publicly-available real-world PIM architecture. The UPMEM PIM architecture combines traditional DRAM memory arrays with general-purpose in-order cores, called DRAM Processing Units (DPUs), integrated in the same chip.
This paper provides the first comprehensive analysis of the first publicly-available real-world PIM architecture. We make two key contributions. First, we conduct an experimental characterization of the UPMEM-based PIM system using microbenchmarks to assess various architecture limits such as compute throughput and memory bandwidth, yielding new insights. Second, we present PrIM (Processing-In-Memory benchmarks), a benchmark suite of 16 workloads from different application domains (e.g., dense/sparse linear algebra, databases, data analytics, graph processing, neural networks, bioinformatics, image processing), which we identify as memory-bound. We evaluate the performance and scaling characteristics of PrIM benchmarks on the UPMEM PIM architecture, and compare their performance and energy consumption to their state-of-the-art CPU and GPU counterparts. Our extensive evaluation conducted on two real UPMEM-based PIM systems with 640 and 2,556 DPUs provides new insights about suitability of different workloads to the PIM system, programming recommendations for software designers, and suggestions and hints for hardware and architecture designers of future PIM systems.

Bio: Juan Gómez-Luna is a senior researcher and lecturer at SAFARI Research Group @ ETH Zürich. He received the BS and MS degrees in Telecommunication Engineering from the University of Sevilla, Spain, in 2001, and the PhD degree in Computer Science from the University of Córdoba, Spain, in 2012.
Between 2005 and 2017, he was a faculty member of the University of Córdoba. His research interests focus on processing-in-memory, memory systems, heterogeneous computing, and hardware and software acceleration of medical imaging and bioinformatics. He is the lead author of PrIM (https://github.com/CMU-SAFARI/prim-benchmarks), the first publicly-available benchmark suite for a real-world processing-in-memory architecture, and Chai (https://github.com/chai-benchmarks/chai), a benchmark suite for heterogeneous systems with CPU/GPU/FPGA.

February 4, 2020, Filed Under: 2020 Spring Seminar, Seminars

Meeting the Systems Challenge of Deep Learning

Speaker: Andreas Herkersdorf, TU Munich
Date: February 4, 2020
Location: EER 3.646

Data access latencies and bandwidth bottlenecks frequently represent major limiting factors for the computational effectiveness of multi- and many-core processor architectures. This talk introduces two conceptually complementary approaches to reduce the synchronization overheads for coherence maintenance and to improve the locality between computing resources and data: Region-based cache coherence and near memory acceleration.

A 2D array of compute tiles with multiple, heterogeneous RISC cores, two levels of caches and a tile-local SRAM memory serves as reference processing platform. Compute tiles, I/O tiles and globally shared DDR SDRAM memory tiles are interconnected by a meshed Network on Chip (NoC) with support for multiple quality of service levels. Overall, this processing architecture follows a distributed-shared-memory model. The limited degree of parallelism in many embedded computing applications also bounds the number of compute tiles possibly sharing associated data structures. Therefore, we envision region- based cache coherence (RBCC) among a limited working set of compute tiles over global coherence approaches. Coherence regions can dynamically be reconfigured at runtime and comprise a number of arbitrary (adjacent or non-adjacent) compute tiles which are interconnected through regular NoC channels for the exchange of coherency protocol messages. We will show that region-based coherence allows maintaining substantially smaller coherence directories (e.g., by approx. 40% reduced in size for 16 tiles systems with up to 4 tiles per region) and shorter sharer checking latencies than global coherence.

Near memory processing is an alternative concept to increase data/task locality by means of near memory accelerators (NMA). NMA positions processing resources for specific forms of data manipulations as close as possible to the data memory. The evident benefits are: reducing global

interconnect usage, shortening of access latencies and, thus, increasing compute efficiency. In distributed-shared-memory architectures, where accelerator units can be affiliated with different tile- local SRAMs as well as with the globally shared DDR SDRAM, near memory acceleration requires thorough consideration of task mapping as well as task and data migration into and among compute tiles.

Andreas Herkersdorf is a professor in the Department of Electrical and Computer Engineering and also affiliated to the Department of Informatics at Technical University of Munich (TUM). He received a Dr. degree from ETH Zurich, Switzerland, in 1991. Between 1988 and 2003, he has been in technical and management positions with the IBM Research Laboratory in Rüschlikon, Switzerland.

Since 2003, Dr. Herkersdorf is the head of the Chair of Integrated Systems at TUM. He is a senior member of the IEEE, member of the DFG (German Research Foundation) Review Board and serves as editor for Springer and De Gruyter journals for design automation and information technology. His research interests include application-specific multi-processor architectures, IP network processing, Network on Chip and self-adaptive fault-tolerant computing.

December 5, 2017, Filed Under: 2017 Fall Seminar, Seminars

Glaring Gaps in Neurally-Inspired Computing

Speaker: Mikko Lipasti, University of Wisconsin Madison

Date: December 5, 2017

November 28, 2017, Filed Under: 2017 Fall Seminar, Seminars

Computer Systems for Neuroscience

Speaker: Abhishek Bhattacharjee, Rutgers University

Date: November 28, 2017

Time: 3:30 pm

Location: POB 2.402

Title: Computer Systems for Neuroscience

Abstract

Computer systems are vital to advancing our understanding of the brain. From embedded chips in brain implants, to server systems used for large-scale brain modeling frameworks, computer systems help shed light on the link between low-level neuronal activity and the brain’s behavioral and cognitive operation. This talk will show the challenges facing such computer systems. We will discuss the extreme energy needs of hardware used in brain implants, and the challenges posed by the computational and data requirements of large-scale brain modeling software. To address these problems, we will discuss recent results from my lab on augmenting hardware to operate harmoniously with software and even the underlying biology of these systems. For example, we will show that perceptron-based hardware branch predictors can be co-opted to predict neuronal spiking activity and can guide power management on brain implants. Further, we will show that the virtual memory layer is a performance bottleneck in server systems for brain modeling software, but that intelligent coordination with the OS layer can counteract many of the memory management problems faced by these systems. Overall, this talk offers techniques that can continue to aid the development of neuroscientific tools.

Speaker Biography

Abhishek Bhattacharjee is an Associate Professor of Computer Science at Rutgers University. His is also a 2017 CV Starr Fellow at the Princeton Neuroscience Institute. His research interests are at the hardware/software interface. Some of the research results from his lab are in widespread commercial use and are implemented in AMD’s latest line of processors and the Linux OS. Abhishek is a recipient of the NSF’s CAREER award, research awards from Google and VMware, and the Chancellor’s Award for Faculty Excellence in Research at Rutgers.

October 31, 2017, Filed Under: 2017 Fall Seminar, Seminars

Teaching Deployed Data Centers New Tricks

Speaker: Derek Chiou, Microsoft

Date: October 31, 2017

Time: 3:45 pm

Location: Avaya Auditorium, POB 2.302

Title: Teaching Deployed Data Centers New Tricks

Abstract

The cloud is an area of intense competition and rapid innovation. Cloud companies are highly incentivized to provide useful, performant differentiated services rapidly and cost-effectively. In this talk, I will describe Microsoft’s approach to enable such services using strategically placed reconfigurable logic, discuss how that introduction can fundamentally change our data center architecture, and show some specific uses and describe their benefits.

Speaker Biography

Derek Chiou is a Partner Architect at Microsoft where he leads the Azure Cloud Silicon team working on FPGAs and ASICs for data center applications and infrastructure, and a researcher in the Electrical and

Computer Engineering Department at The University of Texas at Austin.  Until 2016, he was an associate professor at UT.  His research areas are novel uses of FPGAs, high performance computer simulation,

rapid system design, computer architecture, parallel computing, Internet router architecture, and network processors.  Before going to UT, Dr. Chiou was a system architect and lead the performance

modeling team at Avici Systems, a manufacturer of terabit core routers.  Dr. Chiou received his Ph.D., S.M. and S.B. degrees in Electrical Engineering and Computer Science from MIT.

October 31, 2017, Filed Under: 2017 Fall Seminar, Seminars

Maximizing Server Efficiency: from microarchitecture to machine-learning accelerators

Speaker: Mike Ferdman, Stony Brook University

Date: October 31, 2017

Time: 2:30

Location: POB 2.402

Title: Maximizing Server Efficiency: from microarchitecture to machine-learning accelerators

Abstract

Deep convolutional neural networks (CNNs) are rapidly becoming the dominant approach to computer vision and a major component of many other pervasive machine learning tasks, such as speech recognition, natural language processing, and fraud detection. As a result, accelerators for efficiently evaluating DNNs are rapidly growing in popularity. Our work in this area focuses on two key challenges: minimizing the off-chip data transfer and maximizing the utilization of the computation units. In this talk, I will present an overview of my research work on understanding and improving the efficiency of server systems, and dive deeper into our recent results on FPGA-based server accelerators for machine learning.

Speaker Biography

Mike Ferdman is an Assistant Professor of Computer Science at Stony Brook University, where he co-directs the Computer Architecture Stony Brook (COMPAS) Lab. His research interests are in the area of computer architecture, with particular emphasis on the server computing stack. His current projects center on FPGA accelerators for machine learning, emerging memory technologies, and speculative micro-architectural techniques. Mike received a BS in Computer Science, and BS, MS, and PhD in Electrical and Computer Engineering from Carnegie Mellon University.