October 30, 2025, Filed Under: 2025 Fall Semester, Current Semester[Series 03] Advanced Fabrication Techniques, an Architects Perspective Title: Advanced Fabrication Techniques, an Architects Perspective Speaker: Jeff Stuecheli Date: Tuesday Nov 4th, 2025; 3:30pm Location: EER 3.640/3.642 or Zoom Link Abstract: In the “Post Moore’s Law Era”, advancements in computer systems have been enabled by a wide range of hardware/software features. This talk will focus on ‘advanced’ packaging and Si integration features such as 3D chip stacking. Understanding these new capabilities will be pivotal towards building future systems. This talk will survey both deployed systems and publicly available technology roadmaps. Bio: Dr Stuecheli has been working in Austin since the late 90s after completing my undergrad at UT. He spent 25 years at IBM working on the Power line of high end servers. His initial role was DV on the Power4 product, but transitioned into performance centric architecture work for the Power6 “nest” (caches, coherence, prefetch, NoC, memory, etc). Recognizing the role of the overall system in building optimized designs, his scope gradually expanded. In his later years, IBM attempted to grow beyond proprietary design through collaboration with companies like Nvidia and Google in the Open Power project. Dr Stuecheli then joined Nvidia, Google, and Tenstorrent for realtivly short tenures. He currently works for Arm. His current focus is the development of architectural features to enable overall system optimization. While at IBM he completed graduate work at UT under Dr Lizy John, and remains active through participation on the PC of various conferences (this year ISCA, MICRO, and HPCA).
September 24, 2025, Filed Under: 2025 Fall Semester, Current Semester[SERIES 02] Rethinking the Control Plane for Chiplet-Based Heterogeneous Systems Title: Rethinking the Control Plane for Chiplet-Based Heterogeneous Systems Speaker: Matt Sinclair, University of Wisconsin-Madison Date: Tuesday October 14th, 2025; 3:30pm Location: EER 3.640/3.642 or Zoom Link Abstract: In recent years, system designers have increasingly been turning to heterogeneous systems to improve performance and energy efficiency. Specialized accelerators are frequently used to improve the efficiency of computations that run inefficiently on conventional, general-purpose processors. As a result, systems ranging from smartphones to datacenters, hyperscalers, and supercomputers are increasingly using large numbers of accelerators (including GPUs) while providing better efficiency than CPU-based solutions. In particular, GPUs are widely used in these systems due to their combination of programmability and efficiency. Traditionally, GPUs are throughput-oriented, focused on data parallelism, and assume synchronization happens at a coarse granularity. However, programmers have begun using these systems for a wider variety of applications which exhibit different characteristics, including latency-sensitivity, mixes of both task and data parallelism, and fine-grained synchronization. Thus, future heterogeneous systems must evolve and make deadline-aware scheduling, more intelligent data movement, efficient fine-grained synchronization, and effective power management first-order design constraints. In the first part of this talk, I will discuss our efforts to apply hardware-software co-design to help future heterogeneous systems overcome these challenges and improve performance, energy efficiency, and scalability. Then, in the second part I will discuss how the on-going transition to chiplet-based heterogeneous systems exacerbates these challenges and how we address these challenges in chiplet-based heterogeneous systems by rethinking the control plane. Bio: Matt Sinclair is an Assistant Professor in the Computer Sciences Department at the University of Wisconsin-Madison. He is also an Affiliate Faculty in the ECE Department and Teaching Academy at UW-Madison. His research primarily focuses on how to design, program, and optimize future heterogeneous systems. He also designs the tools for future heterogeneous systems, including serving on the gem5 Project Management Committee and the MLCommons Power, HPC, and Science Working Groups. He is a recipient of the DOE Early Career and NSF CAREER awards, and his work has been funded by the DOE, Google, NSF, and SRC. Matt’s research has also been recognized several times, including an ACM Doctoral Dissertation Award nomination, a Qualcomm Innovation Fellowship, the David J. Kuck Outstanding PhD Thesis Award, and an ACM SIGARCH – IEEE Computer Society TCCA Outstanding Dissertation Award Honorable Mention. He is also the current steward for the ISCA Hall of Fame.
September 5, 2025, Filed Under: 2025 Fall Semester, Current Semester[Series 01] FEATHER: A Reconfigurable Accelerator with Data Reordering Support for Low-Cost On-Chip Dataflow Switching Title: FEATHER: A Reconfigurable Accelerator with Data Reordering Support for Low-Cost On-Chip Dataflow Switching Speaker: Jianming Tong, Georgia Tech Date: Tuesday September 9th , 2025, 3:30pm Location: EER 3.640/3.642 or Zoom Link Abstract: The inference efficiency of diverse ML models over spatial accelerators boils down to the execution of different dataflows (i.e. different tiling, ordering, parallelism, and shapes). Using the optimal dataflow for every layer of workload can reduce latency by up to two orders of magnitude over a suboptimal dataflow. Unfortunately, reconfiguring hardware for different dataflows involves on-chip data layout reordering and datapath reconfigurations, leading to non-trivial overhead that hinders ML accelerators from exploiting different dataflows, resulting in suboptimal performance. To address this challenge, we propose FEATHER, an innovative accelerator that leverages a novel spatial array termed NEST and a novel multi-stage reduction network called BIRRD for performing flexible data reduction with layout reordering under the hood, enabling seamless switching between optimal dataflows with negligible latency and resources overhead. For systematically evaluating the performance interaction between dataflows and layouts, we enhance Timeloop, a state-of-the-art dataflow cost modeling and search framework, with layout assessment capabilities, and term it as Layoutloop. We model FEATHER into Layoutloop and also deploy FEATHER end-to-end on the edge ZCU104 FPGA. FEATHER delivers 1.27~2.89x inference latency speedup and 1.3~6.43x energy efficiency improvement compared to various SoTAs like NVDLA, SIGMA and Eyeriss under ResNet-50 and MobiletNet-V3 in Layoutloop. On practical FPGA devices, FEATHER achieves 2.65/3.91x higher throughput than Xilinx DPU/Gemmini. Remarkably, such performance and energy efficiency enhancements come at only 6% area over a fixed-dataflow Eyeriss-like accelerator. Our code is available at https://github.com/maeri-project/FEATHER. Bio: Jianming Tong (https://jianmingtong.github.io/) is a 4th-year PhD candidate at Georgia Tech, a visiting researcher at MIT. He focuses on full-stack optimizations—spanning model, system, compiler, and hardware—for enhancing both efficiency and privacy of AI systems. He proposed a framework to approximate non-linear ML operators as polynomials to be compatible with Homomorphic Encryption (HE) without utility sacrifice, enabling privacy-preserving ML via HE (model, MLSys’23), and developed the CROSS compiler to convert HE workloads as AI workloads to be accelerated by existing Google TPUs, enabling immediate scalable low-cost privacy-preserving capability to existing AI stacks and designed a dataflow-layout co-switching reconfigurable accelerator for efficient inference of dynamic AI workloads (ISCA’24). These works are widely deployed in NVIDIA, Google, IBM, and recognized by Qualcomm Innovation Fellowship, Machine Learning and System Rising Star, CreateX Startup Launch, and GT NEXT Award.