UT Austin Computer Architecture Seminar Series Welcome to the Computer Architecture Seminar Series at The University of Texas at Austin. Our seminar talks are open to all interested, whether affiliated to UT or not, at no charge. No prior registration is needed, feel free to just show up! Location: EER, near the northeast corner of 24th and Speedway, (Typically) Tuesdays from 3:30pm – 5:00pm. Check talk details to confirm as day, time, and location can vary depending on speaker Please see below for this semester’s schedule. The Information page includes more details, including how to join our mailing list. Current Schedule
[Series 03] Advanced Fabrication Techniques, an Architects Perspective Title: Advanced Fabrication Techniques, an Architects Perspective Speaker: Jeff Stuecheli Date: Tuesday Nov 4th, 2025; 3:30pm Location: EER 3.640/3.642 or Zoom Link Abstract: In the “Post Moore’s Law Era”, advancements in computer systems have been enabled by a wide range of hardware/software features. This talk will focus on ‘advanced’ packaging and Si integration features such as 3D chip stacking. Understanding these new capabilities will be pivotal towards building future systems. This talk will survey both deployed systems and publicly available technology roadmaps. Bio: Dr Stuecheli has been working in Austin since the late 90s after completing my undergrad at UT. He spent 25 years at IBM working on the Power line of high end servers. His initial role was DV on the Power4 product, but transitioned into performance centric architecture work for the Power6 “nest” (caches, coherence, prefetch, NoC, memory, etc). Recognizing the role of the overall system in building optimized designs, his scope gradually expanded. In his later years, IBM attempted to grow beyond proprietary design through collaboration with companies like Nvidia and Google in the Open Power project. Dr Stuecheli then joined Nvidia, Google, and Tenstorrent for realtivly short tenures. He currently works for Arm. His current focus is the development of architectural features to enable overall system optimization. While at IBM he completed graduate work at UT under Dr Lizy John, and remains active through participation on the PC of various conferences (this year ISCA, MICRO, and HPCA).
[SERIES 02] Rethinking the Control Plane for Chiplet-Based Heterogeneous Systems Title: Rethinking the Control Plane for Chiplet-Based Heterogeneous Systems Speaker: Matt Sinclair, University of Wisconsin-Madison Date: Tuesday October 14th, 2025; 3:30pm Location: EER 3.640/3.642 or Zoom Link Abstract: In recent years, system designers have increasingly been turning to heterogeneous systems to improve performance and energy efficiency. Specialized accelerators are frequently used to improve the efficiency of computations that run inefficiently on conventional, general-purpose processors. As a result, systems ranging from smartphones to datacenters, hyperscalers, and supercomputers are increasingly using large numbers of accelerators (including GPUs) while providing better efficiency than CPU-based solutions. In particular, GPUs are widely used in these systems due to their combination of programmability and efficiency. Traditionally, GPUs are throughput-oriented, focused on data parallelism, and assume synchronization happens at a coarse granularity. However, programmers have begun using these systems for a wider variety of applications which exhibit different characteristics, including latency-sensitivity, mixes of both task and data parallelism, and fine-grained synchronization. Thus, future heterogeneous systems must evolve and make deadline-aware scheduling, more intelligent data movement, efficient fine-grained synchronization, and effective power management first-order design constraints. In the first part of this talk, I will discuss our efforts to apply hardware-software co-design to help future heterogeneous systems overcome these challenges and improve performance, energy efficiency, and scalability. Then, in the second part I will discuss how the on-going transition to chiplet-based heterogeneous systems exacerbates these challenges and how we address these challenges in chiplet-based heterogeneous systems by rethinking the control plane. Bio: Matt Sinclair is an Assistant Professor in the Computer Sciences Department at the University of Wisconsin-Madison. He is also an Affiliate Faculty in the ECE Department and Teaching Academy at UW-Madison. His research primarily focuses on how to design, program, and optimize future heterogeneous systems. He also designs the tools for future heterogeneous systems, including serving on the gem5 Project Management Committee and the MLCommons Power, HPC, and Science Working Groups. He is a recipient of the DOE Early Career and NSF CAREER awards, and his work has been funded by the DOE, Google, NSF, and SRC. Matt’s research has also been recognized several times, including an ACM Doctoral Dissertation Award nomination, a Qualcomm Innovation Fellowship, the David J. Kuck Outstanding PhD Thesis Award, and an ACM SIGARCH – IEEE Computer Society TCCA Outstanding Dissertation Award Honorable Mention. He is also the current steward for the ISCA Hall of Fame.
[Series 01] FEATHER: A Reconfigurable Accelerator with Data Reordering Support for Low-Cost On-Chip Dataflow Switching Title: FEATHER: A Reconfigurable Accelerator with Data Reordering Support for Low-Cost On-Chip Dataflow Switching Speaker: Jianming Tong, Georgia Tech Date: Tuesday September 9th , 2025, 3:30pm Location: EER 3.640/3.642 or Zoom Link Abstract: The inference efficiency of diverse ML models over spatial accelerators boils down to the execution of different dataflows (i.e. different tiling, ordering, parallelism, and shapes). Using the optimal dataflow for every layer of workload can reduce latency by up to two orders of magnitude over a suboptimal dataflow. Unfortunately, reconfiguring hardware for different dataflows involves on-chip data layout reordering and datapath reconfigurations, leading to non-trivial overhead that hinders ML accelerators from exploiting different dataflows, resulting in suboptimal performance. To address this challenge, we propose FEATHER, an innovative accelerator that leverages a novel spatial array termed NEST and a novel multi-stage reduction network called BIRRD for performing flexible data reduction with layout reordering under the hood, enabling seamless switching between optimal dataflows with negligible latency and resources overhead. For systematically evaluating the performance interaction between dataflows and layouts, we enhance Timeloop, a state-of-the-art dataflow cost modeling and search framework, with layout assessment capabilities, and term it as Layoutloop. We model FEATHER into Layoutloop and also deploy FEATHER end-to-end on the edge ZCU104 FPGA. FEATHER delivers 1.27~2.89x inference latency speedup and 1.3~6.43x energy efficiency improvement compared to various SoTAs like NVDLA, SIGMA and Eyeriss under ResNet-50 and MobiletNet-V3 in Layoutloop. On practical FPGA devices, FEATHER achieves 2.65/3.91x higher throughput than Xilinx DPU/Gemmini. Remarkably, such performance and energy efficiency enhancements come at only 6% area over a fixed-dataflow Eyeriss-like accelerator. Our code is available at https://github.com/maeri-project/FEATHER. Bio: Jianming Tong (https://jianmingtong.github.io/) is a 4th-year PhD candidate at Georgia Tech, a visiting researcher at MIT. He focuses on full-stack optimizations—spanning model, system, compiler, and hardware—for enhancing both efficiency and privacy of AI systems. He proposed a framework to approximate non-linear ML operators as polynomials to be compatible with Homomorphic Encryption (HE) without utility sacrifice, enabling privacy-preserving ML via HE (model, MLSys’23), and developed the CROSS compiler to convert HE workloads as AI workloads to be accelerated by existing Google TPUs, enabling immediate scalable low-cost privacy-preserving capability to existing AI stacks and designed a dataflow-layout co-switching reconfigurable accelerator for efficient inference of dynamic AI workloads (ISCA’24). These works are widely deployed in NVIDIA, Google, IBM, and recognized by Qualcomm Innovation Fellowship, Machine Learning and System Rising Star, CreateX Startup Launch, and GT NEXT Award.
[Series 03] Enabling Ahead Prediction with Practical Energy Constraints Title: Enabling Ahead Prediction with Practical Energy Constraints Speaker: Lingzhe Chester Cai, PhD Student, UT ECE Date: Tuesday April 15th, 2025, 3:30pm Location: EER 1.518 or Zoom Link Abstract: Decades of research on branch prediction results in complex prediction algorithms and large look up tables, leading to a multi-cycle prediction latency, adversely impacting performance. Ahead prediction is a proposed solution to the predictor latency problem, but drastically increases prediction energy as exponentially more entries are read out for each branch skipped, making building such a predictor impractical. In this talk, I will show that only a few missing history patterns are observed in the program’s runtime. Using this insight, we present a new approach for building ahead predictors that does not require reading exponentially more entries for large ahead distances. Our ahead predictor provides a 4.4% performance improvement while increasing power by only 1.5x, as opposed to prior designs that incur a 14.6x energy overhead. By hiding the predictor latency from the rest of the pipeline, our work allows for larger and more complex predictors and better pipelining width scaling. In addition, our work implies that the direction of an easy-to-predict branch does not need to be pushed to the history, presenting opportunities for future branch predictor design. Bio: Chester Cai is a 7th year PhD student studying CPU microarchitecture under Professor Yale Patt. His research focuses on the CPU frontend, specifically branch prediction, balancing predictor accuracy, latency and throughput. Before Joining UT Austin, he obained his bachelor degree in Computer Engineering from Rose-Hulman Institute of Technology.
[Series 02] RESCQ: Realtime Scheduling for Continuous Angle Quantum Error Correction Architectures Title: RESCQ: Realtime Scheduling for Continuous Angle Quantum Error Correction Architectures Speaker: Sayam Sethi, PhD Student, UT ECE Date: Tuesday April 8th, 2025, 3:30pm Location: EER 1.518 or Zoom Link Abstract: In order to realize large scale quantum error correction (QEC), resource states, such as |T〉, must be prepared which is expensive in both space and time. In order to circumvent this problem, alternatives have been proposed, such as the production of continuous angle rotation states. However, the production of these states is non-deterministic and may require multiple repetitions to succeed. The original proposals suggest architectures which do not account for realtime (or dynamic) management of resources to minimize total execution time. Without a realtime scheduler, a statically generated schedule will be unnecessarily expensive. We propose RESCQ (pronounced rescue), a realtime scheduler for programs compiled onto these continuous angle systems. Our scheme actively minimizes total cycle count by on-demand redistribution of resources based on expected production rates. Depending on the underlying hardware, this can cause excessive classical control overhead. We further address this by dynamically selecting the frequency of our recomputation. RESCQ improves over baseline proposals by an average of 2x in cycle count. Bio: Sayam Sethi is a PhD student in the ECE Department at The University of Texas at Austin, advised by Dr. Jonathan Baker. He is currently interested in architectural design for realising Fault-Tolerant Quantum Computers (FTQC), with a specific focus on scheduling realtime operations, and minimizing program runtime. Before joining UT, he obtained his B. Tech. in Computer Science and Engineering from IIT Delhi.
Welcome to CompArch 2025 Spring UT Austin Computer Architecture Seminar Series 2025 Spring Sponsored by: DateSeriesTopicSpeakerJanuary 24, 2025Series 01Securing Computer Systems using AI Methods and for AI Applications Mulong LuoApril 8, 2025Series 02RESCQ: Realtime Scheduling for Continuous Angle Quantum Error Correction ArchitecturesSayam SethiApril 15, 2025Series 03Enabling Ahead Prediction with Practical Energy ConstraintsLingzhe Chester Cai
[Series 01] Securing Computer Systems using AI Methods and for AI Applications Title: Securing Computer Systems using AI Methods and for AI Applications Speaker: Mulong Luo, Postdoctoral Researcher, UT ECE Date: Friday January 24, 2025, 3:30pm Location: EER 0.806/0.808 or Zoom Link Abstract: Securing modern computer systems against an ever-evolving threat landscape is a significant challenge that requires innovative approaches. Recent developments in artificial intelligence (AI), such as large language models (LLMs) and reinforcement learning (RL), have achieved unprecedented success in everyday applications. However, AI serves as a double-edged sword for computer systems security. On one hand, the superhuman capabilities of AI enable the exploration and detection of vulnerabilities without the need for human experts. On the other hand, specialized systems required to implement new AI applications introduce novel security vulnerabilities. In this talk, I will first present my work on applying AI methods to system security. Specifically, I leverage reinforcement learning to explore microarchitecture attacks in modern processors. Additionally, I will discuss the use of multi-agent reinforcement learning to improve the accuracy of detectors against adaptive attackers. Next, I will highlight my research on the security of AI systems, focusing on retrieval-augmented generation (RAG)-based LLMs and autonomous vehicles. For RAG-based LLMs, my ConfusedPilot work demonstrates how an attacker can compromise confidentiality and integrity guarantees by sharing a maliciously crafted document. For autonomous vehicles, I reveal a software-based cache side-channel attack capable of leaking the physical location of a vehicle without detection. Finally, I will outline future directions for building secure systems using AI methods and ensuring the security of AI systems. Bio: Mulong Luo is currently a postdoctoral researcher at the University of Texas at Austin hosted by Mohit Tiwari. His research interests lie broadly in applying AI methods for computer architecture and system security, as well as improving the security of AI systems including LLM and autonomous vehicles. He is selected as a CPS Rising Star 2023. His paper is selected as a finalist in Top Picks in Hardware and Embedded Security 2022. He is also awarded the best paper award at CPS-SPC 2018. Mulong received Ph.D. at Cornell University advised by Edward Suh in 2023. He got his MS and BS from UCSD and Peking University respectively.
Welcome to CompArch 2024 Fall UT Austin Computer Architecture Seminar Series 2024 Fall Sponsored by: DateSeriesTopicSpeakerSeptember 10, 2024Series 01aExperimentally Understanding and Efficiently Mitigating DRAM Read Disturbance Ataberk Olgun, ETH ZürichSeptember 11, 2024Series 01b Enabling the Adoption of Data-Centric Systems: Hardware/Software Support for Processing-Using-Memory ArchitecturesGeraldo F. Oliveira, ETH Zürich September 17, 2024Series 02Leveraging the IRON AI Engine API to program the Ryzen™ AI NPUKristof Denolf & Joseph MelberOctober 1, 2024Series03Reliable Processing-in-MemoryJeageun JungOctober 8, 2024Series 04Characterization of network proxies in micro-service orchestrationPrateek SahuOctober 29, 2024Series 05FPGA Device and Floorplan-aware Accelerator Implementation via Domain-specific ToolingDinesh Gaitonde & Abhishek Kumar JainNovember 1, 2024Series 06Leveraging Serial Interfaces to Scale the Memory Wall in Server ArchitecturesAlexandros DaglisNovember 7, 2024Series 07Resource-efficient AI System DesignAna KlimovićNovember 7, 2024Series 08Enabling Efficient Memory Systems using Novel Compression MethodsPer Stenström
[Series 06] Leveraging Serial Interfaces to Scale the Memory Wall in Server Architectures Speaker: Dr. Alexandros Daglis Title: Leveraging Serial Interfaces to Scale the Memory Wall in Server Architectures Date: November 1st, 2024 at 1:30 pm Location: EER 0.806/0.808 or Zoom Link Abstract: The memory system has historically been a primary performance determinant for server-grade computers. The multi-faceted challenges it poses is commonly referred to as the “memory wall”, referring to rigid capacity, bandwidth, and cost constraints. Current technological trends motivate a memory architecture rethink by leveraging serial interfaces, opening opportunities to overcome current limitations. Specifically, these opportunities are embodied by the emerging Compute Express Link (CXL) technology, which is garnering widespread adoption in the industry. CXL is well-positioned to revolutionize the way server systems are built and deployed, as it enables new capabilities in memory system design. CXL-centric or CXL-augmented memory systems bear characteristics that cater well to the growing demands of modern workloads. This talk will focus on two new CXL-centric memory systems for server architectures. First, we will see how a CXL-only memory system can drastically benefit modern manycore CPUs handling bandwidth-intensive workloads, despite the CXL interface’s seemingly prohibitive latency premium. Second, we will study how CXL’s memory pooling capability can be leveraged to accelerate workloads with little data locality on large-scale multi-socket NUMA systems. Both architectural approaches promise performance gains of up to 3x for their respective workload domain. Bio: Alexandros (Alex) Daglis is an Assistant Professor of Computer Science at Georgia Tech. Daglis’ research interests lie in computer architecture, with specific interests in datacenter architectures, network-compute co-design, and memory systems. His research has been supported by the NSF, IARPA, Speculative Technologies, Samsung, and Intel Corporation, and routinely appears at top-tier computer architecture venues such as ISCA, MICRO, ASPLOS, and HPCA. Daglis is a recipient of the NSF CAREER award, a Google Faculty Research Award, and a Georgia Tech Junior Faculty Teaching Award, and his PhD thesis (EPFL, 2018) was recognized with an ACM SIGARCH/IEEE CS TCCA Outstanding Dissertation Honorable Mention.
[Series 05] FPGA Device and Floorplan-aware Accelerator Implementation via Domain-specific Tooling Speaker: Dinesh Gaitonde, Abhishek Kumar Jain Title: FPGA Device and Floorplan-aware Accelerator Implementation via Domain-specific Tooling Date: October 29th, 2024 at 3:30 pm Location: EER 3.646 or Zoom Link Abstract: Reconfigurable devices, including AIE CGRA, FPGA fabric, HBM stacks, System-wide NoC, and ARM processing sub-system, offer diverse design options due to their heterogeneous nature. Design tools such as Vitis use a push-button approach, where application RTL is generated via HLS and then undergoes synthesis, placement, and routing in Vivado. This method often yields sub-optimal results (PPA) because high-level design semantics, such as processing element structure, composition, memory hierarchy, and interconnect, are lost during implementation. Therefore, the challenge lies in designing accelerators on FPGAs to fully use the FPGA resources, while still preserving designer productivity while still leveraging design and device characteristics. This presentation will focus on a few projects in our team (AMD FPGA architecture group) that deal with problems in diverse domains by exploiting the semantics of the problem being mapped and the specifics of the architecture to which it is mapped to. Existing SpMV accelerators on HBM-enabled FPGA platforms do not scale well, resulting in underutilization of the HBM bandwidth. Poor scaling of existing accelerator designs prevents us from using the entire HBM bandwidth. Physically unaware system design prevents us from achieving high frequency of operation. To address these issues, we propose a modular and leanSpMV dataflow accelerator and then implement it on FPGA fabric using our “Atoms” methodology. This is the first work that can use the entire bandwidth of commercial HBM-enabled FPGAs and surpass all reported frequencies while doing so. The “Atoms” methodology relies on exploiting design semantics to generate efficient floorplans for the design. We decompose the design into smaller building blocks communicating over latency-insensitive elastic channels. To navigate the heterogeneous canvas that modern FPGAs present, we add the required number of elastic buffers so that communication never becomes the frequency limiter. We expect this pattern to apply to a wide variety of other domains as well. The second project focuses on streaming neural networks, specifically – the FINN framework. FINN takes a high-level description of a network and then generates RTL followed by FPGA implementation. Depending on the resource budget, FINN can generate bunch of designs with varying throughput and resource requirements. One of the key building blocks in FINN generated network is the streaming Matrix Vector multiplication Unit (MVU). We propose to design MVU in a structured way so that we can extract maximum performance out of the device resources. DSP blocks can achieve close to 1 GHz on latest Versal FPGAs, and we are hoping to generate MVU units which also operate close to this limit. We plan to create an overlay MVU (with high fmax) which is instruction-programmable but does not exhibit overheads associated with usual overlays. All the blocks in our overlay MVU are supposed to be highly customized for FINN, specifically the DSP-based dot-engine ALU, register files for activation and weights, and instruction memory as well. Our approach relies on elastic communication between building blocks so that we can insert pipeline stages even after blocks are placed on the FPGA fabric. The third project is about packet processing using FPGA networking overlay also referred to as Packet Processing Engine (PPE). PPE is instruction-programmable, and AMD’s compiler can compile networking workloads (expressed in eBPF) on PPE. We are currently exploring ways to customize the overlay once we have compiled a networking workload on top of it. Our hope is to generate workload-specific PPEs mapped on FPGA fabric so that we do not have to pay the “overlay tax”. Finally, we present how some aspects of the problem faced by verification customers is one that is amenable to structured implmentation. We demonstrate how similar ideas discussed so far help us significantly improve the performance and lower the resources used for those workloads. Over the long term, we expect to develop a set of domain-specific optimized implementation flows that exploit a handful of basic concepts. Since the entire flow (including implementation) is aware of how the physical architecture of the FPGA interacts with the specific design needs, we expect the proposed flow to result in higher performance implementations compared to simply handing off a design at RTL after synthesis from some general-purpose HLS engine. Bio: Dinesh Gaitonde got his Bachelor’s and Master’s from IIT Bombay and his PhD from CMU in Electrical Engineering. He is currently a Senior Fellow at AMD (Xilinx) focusing on FPGA architectures, applications and implementation algorithms. Previous to AMD he has worked at Motorola, Synopsys as an EDA researcher. His interests include FPGA & Other Reconfigurable Fabrics, High Performance Computing on Reconfigurable Fabrics and EDA for FPGAs. Abhishek Kumar Jain received the PhD degree in computer engineering from Nanyang Technological University, Singapore, in 2016. After that, he was a postdoc at Lawrence Livermore National Laboratory. Since 2018, he has been an architect with Xilinx USA. His research interests include computer architecture, FPGAs, high-performance accelerators, and domain-specific FPGA overlays
[Series 04] Characterization of network proxies in micro-service orchestration Speaker: Prateek Sahu Title: Characterization of network proxies in micro-service orchestration Date: October 8th, 2024 at 3:30 pm Location: EER 3.646 or Zoom Link Abstract: Network proxies, aka sidecars, are used by organizations to manage and run hundreds of cloud microservices in a consistent manner. Since sidecars interpose on network traffic to provide telemetry and security features, they can degrade critical service level metrics such as latency and throughput. However, the precise impact of sidecars on such key metrics is unclear. We introduce SCoPE to quantify service-layer overheads as well as the micro-architectural implications of using sidecars in service meshes – and characterize these overheads across a range of sidecar configurations. SCoPE demonstrates that sidecars can degrade latency and throughput by up to 150% and 35%, respectively, across common benchmark applications. We find that the absolute overheads of the sidecars are independent of the microservices being proxied and depend on the proxy configuration and the microservice topology. Our micro-architectural analysis of sidecars indicates no discernible reuse of the instruction caches (i.e., poor misses per kilo instructions/MPKI) despite high-frequency reuse of sidecars. We note that increasing the private caches from 256KB to 1.25MB across processor generations sees only a 10% improvement in the processor frontend – this is due to high indirect branch misses and thrashing from more aggressive prefetchers and predictors that degrade the L1-I cache MPKIs up to 40%. Our analysis also finds that utilizing a few large pages can reduce iTLB misses and page walks by 80% at the cost of modest memory overheads. Bio: Prateek is a 5th year PhD student in ACSES, working with Dr. Mohit Tiwari in the SPARK Research Lab. His interests include hardware and systems security. He is currently working towards cross stack system security and orchestration while his prior work have included hardware side-channel attacks and defenses.
[Series 03] Reliable Processing-in-Memory Speaker: Jeageun Jung Title: Reliable Processing-in-Memory Date: October 1st, 2024 at 3:30 pm Location: EER 3.646 or Zoom Link Abstract: Processing-in-memory (PIM) architectures enhance performance by integrating compute units near memory but introduce reliability challenges. Bank-PIMs maximize performance by placing compute units near memory banks but limit error-checking and correcting (ECC) to local domains, making it insufficient to handle faults and scaling-induced errors. Bio: Jeageun Jung’s research addresses this reliability gap by developing a PIM-specific ECC scheme tuned for the expected fault and error patterns expected in near-bank PIMs. To do this, Jeageun Jung also develops a new DRAM physical fault model based on empirical data that accurately predicts fault behavior across memory types.
[Series 08] Enabling Efficient Memory Systems using Novel Compression Methods Title: Enabling Efficient Memory Systems using Novel Compression Methods Speaker: Per Stenström Chalmers University of Technology / ZeroPoint TechnologiesGoteborg, Sweden Date: Nov 7th, 2024 at 3:30 pm Location: EER 3.646 or Zoom Link Abstract: Using data compression methods in the memory hierarchy can improve theefficiency of memory systems by enabling higher effective cache capacity,more effective use of available memory bandwidth and by enabling highereffective main memory capacity. This can lead to substantially higherperformance and lower power consumption. However, to enable these valuesrequires highly effective compression algorithms that can be implementedwith low latency and high throughput. Research at Chalmers University ofTechnology and at ZeroPoint Technologies, a fabless startup company, hasyielded many new families of compression methods that are now beingcommercially deployed. This talk will present the major insights of morethan a decade of research on memory compression methods for the memoryhierarchy. The talk covers value-aware caches and statistical compressionof cache content, compression algorithms that are tuned to the data at handthrough data analysis using new clustering algorithms to allow forsubstantially higher memory bandwidth and compression infrastructuresthat expand capacity of main memory. Bio: Per Stenstrom is professor at Chalmers University of Technology. His research interests are in parallel computer architecture. He has authored or co-authored four textbooks, about 200 publications and twenty patents in this area. He has been program chairman of several top-tier IEEE and ACM conferences including IEEE/ACM Symposium on Computer Architecture and acts as Associate Editor of ACM TACO and Topical Editor IEEE Transaction on Computers. He is a Fellow of the ACM and the IEEE and a member of Academia Europaea and the Royal Swedish Academy of Engineering Sciences.
[Series 02] Leveraging the IRON AI Engine API to program the Ryzen™ AI NPU Title: Leveraging the IRON AI Engine API to program the Ryzen™ AI NPU Speaker: Kristof Denolf & Joseph Melber Date: September 17, 2024 at 3:30 pm Location: EER 3.646 or Zoom Link Abstract: Specialized hardware accelerators are abundantly available today including NPUs found in consumer laptops with AMD Ryzen™ AI CPUs. The NPU of AMD Ryzen™ AI devices includes an AI Engine array comprised of a set of VLIW vector processors, data movement accelerators (DMAs) and adaptable interconnect. By providing convenient software tool flows to program these devices, enthusiasts are enabled to productively harness the full capabilities of these powerful NPUs. IRON is a close-to-metal open-source toolkit enabling performance engineers to build fast and efficient, often specialized, designs through a set of Python language bindings around the mlir-aie dialect. The presentation will provide insights into the AI Engine compute and data movement capabilities supported in our tool flow. The speakers will demonstrate performance optimizations of increasingly complex designs by leveraging the unique architectural features of AI Engines. Bio: Kristof Denolf is a Fellow in AMD’s Research and Advanced Development group where he is working on energy-efficient computer vision and video processing applications to shape future AMD devices. He earned an M.Eng. in electronics from the Katholieke Hogeschool Brugge-Oostende (1998), now part of KULeuven, an M.Sc. in electronic system design from Leeds Beckett University (2000), and a Ph.D. from the Technical University Eindhoven (2007). He has over 25 years of combined research and industry experience at IMEC, Philips, Barco, Apple, Xilinx, and AMD. His main research interests are all aspects of the cost-efficient and dataflow-oriented design of video, vision, and graphics systems. Joseph Melber is a Senior Member of Technical Staff in AMD’s Research and Advanced Development group. At AMD, he is working on hardware architectures and compiler technologies for current and future AMD devices. He received a BS in electrical engineering from the University Buffalo, as well as MS and PhD degrees from the electrical and computer engineering department at Carnegie Mellon University. His research interests include runtime systems, compiler abstractions for data movement, and hardware prototypes for future adaptive heterogeneous computing architectures.
[Series 01a] Experimentally Understanding and Efficiently Mitigating DRAM Read Disturbance Title: Experimentally Understanding and Efficiently Mitigating DRAM Read Disturbance Speaker: Ataberk Olgun Date: September 10, 2024 at 3:30 pm Location: EER 3.650 or Zoom Link Talk abstract: DRAM chips are increasingly more vulnerable to read disturbance phenomena (e.g., RowHammer and RowPress), where repeatedly accessing DRAM rows causes bitflips in nearby rows due to DRAM density scaling. Even though many prior works develop various RowHammer solutions, these solutions incur non-negligible and increasingly higher system performance, energy, and hardware area overheads as RowHammer vulnerability worsens. In this talk, we will present our recent works on 1) understanding DRAM read disturbance in modern high bandwidth memory (HBM) chips, along with the open source infrastructure that enables experimental studies on state-of-the-art DRAM chips, and 2) performance-, energy-, and area-efficient system-level solutions to read disturbance. First, we describe the results of a detailed experimental analysis of read disturbance in six real HBM2 chips. We show that (1) the read disturbance vulnerability significantly varies between different HBM2 chips and between different components (e.g., 3D-stacked channels) inside a chip, (2) DRAM rows at the end and in the middle of a bank are more resilient to read disturbance, (3) fewer additional activations are sufficient to induce more read disturbance bitflips in a DRAM row if the row exhibits the first bitflip at a relatively high activation count, and (4) a modern HBM2 chip implements undocumented read disturbance defenses that track potential aggressor rows based on how many times they are activated. We also briefly describe the infrastructure that enabled the discoveries we made in our study on read disturbance in high bandwidth memory chips along with those made in multiple recent works that investigate read disturbance in real DRAM chips (e.g., RowPress). Second, we introduce ABACuS, a new low-cost hardware-counter-based RowHammer mitigation technique that performance-, energy-, and area-efficiently scales with worsening RowHammer vulnerability. ABACuS’s key idea is to use a single shared row activation counter to track activations to the rows with the same row address in all DRAM banks. Unlike state-of-the-art RowHammer mitigation mechanisms that implement a separate row activation counter for each DRAM bank, ABACuS implements fewer counters (e.g., only one) to track an equal number of aggressor rows. At very low RowHammer thresholds (where only 125 activations cause a bitflip), ABACuS induces small system performance and DRAM energy overhead, and outperforms and takes up smaller chip area than the state-of-the-art mitigation techniques (Hydra and Graphene). All data, sources, and paper PDFs for the described works are freely and openly available.– HBM Read Disturbance: https://github.com/CMU-SAFARI/HBM-Read-Disturbance, Paper PDF: https://arxiv.org/pdf/2310.14665– DRAM Bender: https://github.com/CMU-SAFARI/DRAM-Bender, Paper PDF: https://arxiv.org/pdf/2211.05838– ABACuS sources: https://github.com/CMU-SAFARI/ABACuS, Paper PDF: https://arxiv.org/pdf/2310.09977 Bio: Ataberk Olgun is a 3rd year PhD student at ETH Zurich. His broad research interests include designing secure, high-performance, and energy-efficient DRAM architectures. Especially with worsening RowHammer vulnerability, it is increasingly difficult to design new DRAM architectures that satisfy all three characteristics. His current research focuses on i) deeply understanding and ii) efficiently mitigating the RowHammer vulnerability in modern systems.
Past Semesters 2024 Spring 2022 Fall 2020 Spring 2019 Spring 2019 Fall 2018 Fall 2017 Spring 2017 Fall 2016 Spring 2016 Fall 2015 Spring 2015 Fall 2014 Spring