April 10, 2025, Filed Under: 2025 Spring Semester, Current Semester[Series 03] Enabling Ahead Prediction with Practical Energy Constraints Title: Enabling Ahead Prediction with Practical Energy Constraints Speaker: Lingzhe Chester Cai, PhD Student, UT ECE Date: Tuesday April 15th, 2025, 3:30pm Location: EER 1.518 or Zoom Link Abstract: Decades of research on branch prediction results in complex prediction algorithms and large look up tables, leading to a multi-cycle prediction latency, adversely impacting performance. Ahead prediction is a proposed solution to the predictor latency problem, but drastically increases prediction energy as exponentially more entries are read out for each branch skipped, making building such a predictor impractical. In this talk, I will show that only a few missing history patterns are observed in the program’s runtime. Using this insight, we present a new approach for building ahead predictors that does not require reading exponentially more entries for large ahead distances. Our ahead predictor provides a 4.4% performance improvement while increasing power by only 1.5x, as opposed to prior designs that incur a 14.6x energy overhead. By hiding the predictor latency from the rest of the pipeline, our work allows for larger and more complex predictors and better pipelining width scaling. In addition, our work implies that the direction of an easy-to-predict branch does not need to be pushed to the history, presenting opportunities for future branch predictor design. Bio: Chester Cai is a 7th year PhD student studying CPU microarchitecture under Professor Yale Patt. His research focuses on the CPU frontend, specifically branch prediction, balancing predictor accuracy, latency and throughput. Before Joining UT Austin, he obained his bachelor degree in Computer Engineering from Rose-Hulman Institute of Technology.
April 4, 2025, Filed Under: 2025 Spring Semester, Current Semester[Series 02] RESCQ: Realtime Scheduling for Continuous Angle Quantum Error Correction Architectures Title: RESCQ: Realtime Scheduling for Continuous Angle Quantum Error Correction Architectures Speaker: Sayam Sethi, PhD Student, UT ECE Date: Tuesday April 8th, 2025, 3:30pm Location: EER 1.518 or Zoom Link Abstract: In order to realize large scale quantum error correction (QEC), resource states, such as |T〉, must be prepared which is expensive in both space and time. In order to circumvent this problem, alternatives have been proposed, such as the production of continuous angle rotation states. However, the production of these states is non-deterministic and may require multiple repetitions to succeed. The original proposals suggest architectures which do not account for realtime (or dynamic) management of resources to minimize total execution time. Without a realtime scheduler, a statically generated schedule will be unnecessarily expensive. We propose RESCQ (pronounced rescue), a realtime scheduler for programs compiled onto these continuous angle systems. Our scheme actively minimizes total cycle count by on-demand redistribution of resources based on expected production rates. Depending on the underlying hardware, this can cause excessive classical control overhead. We further address this by dynamically selecting the frequency of our recomputation. RESCQ improves over baseline proposals by an average of 2x in cycle count. Bio: Sayam Sethi is a PhD student in the ECE Department at The University of Texas at Austin, advised by Dr. Jonathan Baker. He is currently interested in architectural design for realising Fault-Tolerant Quantum Computers (FTQC), with a specific focus on scheduling realtime operations, and minimizing program runtime. Before joining UT, he obtained his B. Tech. in Computer Science and Engineering from IIT Delhi.
January 18, 2025, Filed Under: 2025 Spring Semester, Current SemesterWelcome to CompArch 2025 Spring UT Austin Computer Architecture Seminar Series 2025 Spring Sponsored by: DateSeriesTopicSpeakerJanuary 24, 2025Series 01Securing Computer Systems using AI Methods and for AI Applications Mulong LuoApril 8, 2025Series 02RESCQ: Realtime Scheduling for Continuous Angle Quantum Error Correction ArchitecturesSayam SethiApril 15, 2025Series 03Enabling Ahead Prediction with Practical Energy ConstraintsLingzhe Chester Cai
January 18, 2025, Filed Under: 2025 Spring Semester, Current Semester[Series 01] Securing Computer Systems using AI Methods and for AI Applications Title: Securing Computer Systems using AI Methods and for AI Applications Speaker: Mulong Luo, Postdoctoral Researcher, UT ECE Date: Friday January 24, 2025, 3:30pm Location: EER 0.806/0.808 or Zoom Link Abstract: Securing modern computer systems against an ever-evolving threat landscape is a significant challenge that requires innovative approaches. Recent developments in artificial intelligence (AI), such as large language models (LLMs) and reinforcement learning (RL), have achieved unprecedented success in everyday applications. However, AI serves as a double-edged sword for computer systems security. On one hand, the superhuman capabilities of AI enable the exploration and detection of vulnerabilities without the need for human experts. On the other hand, specialized systems required to implement new AI applications introduce novel security vulnerabilities. In this talk, I will first present my work on applying AI methods to system security. Specifically, I leverage reinforcement learning to explore microarchitecture attacks in modern processors. Additionally, I will discuss the use of multi-agent reinforcement learning to improve the accuracy of detectors against adaptive attackers. Next, I will highlight my research on the security of AI systems, focusing on retrieval-augmented generation (RAG)-based LLMs and autonomous vehicles. For RAG-based LLMs, my ConfusedPilot work demonstrates how an attacker can compromise confidentiality and integrity guarantees by sharing a maliciously crafted document. For autonomous vehicles, I reveal a software-based cache side-channel attack capable of leaking the physical location of a vehicle without detection. Finally, I will outline future directions for building secure systems using AI methods and ensuring the security of AI systems. Bio: Mulong Luo is currently a postdoctoral researcher at the University of Texas at Austin hosted by Mohit Tiwari. His research interests lie broadly in applying AI methods for computer architecture and system security, as well as improving the security of AI systems including LLM and autonomous vehicles. He is selected as a CPS Rising Star 2023. His paper is selected as a finalist in Top Picks in Hardware and Embedded Security 2022. He is also awarded the best paper award at CPS-SPC 2018. Mulong received Ph.D. at Cornell University advised by Edward Suh in 2023. He got his MS and BS from UCSD and Peking University respectively.
November 5, 2024, Filed Under: 2024 Fall Semester, Current SemesterWelcome to CompArch 2024 Fall UT Austin Computer Architecture Seminar Series 2024 Fall Sponsored by: DateSeriesTopicSpeakerSeptember 10, 2024Series 01aExperimentally Understanding and Efficiently Mitigating DRAM Read Disturbance Ataberk Olgun, ETH ZürichSeptember 11, 2024Series 01b Enabling the Adoption of Data-Centric Systems: Hardware/Software Support for Processing-Using-Memory ArchitecturesGeraldo F. Oliveira, ETH Zürich September 17, 2024Series 02Leveraging the IRON AI Engine API to program the Ryzen™ AI NPUKristof Denolf & Joseph MelberOctober 1, 2024Series03Reliable Processing-in-MemoryJeageun JungOctober 8, 2024Series 04Characterization of network proxies in micro-service orchestrationPrateek SahuOctober 29, 2024Series 05FPGA Device and Floorplan-aware Accelerator Implementation via Domain-specific ToolingDinesh Gaitonde & Abhishek Kumar JainNovember 1, 2024Series 06Leveraging Serial Interfaces to Scale the Memory Wall in Server ArchitecturesAlexandros DaglisNovember 7, 2024Series 07Resource-efficient AI System DesignAna KlimovićNovember 7, 2024Series 08Enabling Efficient Memory Systems using Novel Compression MethodsPer Stenström
November 5, 2024, Filed Under: 2024 Fall Semester[Series 01] Securing Computer Systems using AI Methods and for AI Applications Title: Securing Computer Systems using AI Methods and for AI Applications Speaker: Mulong Luo, Postdoctoral Researcher, UT ECE Date: Tuesday January 21, 2025, 3:30pm Location: EER 0.806/0.808 or Zoom Link Abstract: Securing modern computer systems against an ever-evolving threat landscape is a significant challenge that requires innovative approaches. Recent developments in artificial intelligence (AI), such as large language models (LLMs) and reinforcement learning (RL), have achieved unprecedented success in everyday applications. However, AI serves as a double-edged sword for computer systems security. On one hand, the superhuman capabilities of AI enable the exploration and detection of vulnerabilities without the need for human experts. On the other hand, specialized systems required to implement new AI applications introduce novel security vulnerabilities. In this talk, I will first present my work on applying AI methods to system security. Specifically, I leverage reinforcement learning to explore microarchitecture attacks in modern processors. Additionally, I will discuss the use of multi-agent reinforcement learning to improve the accuracy of detectors against adaptive attackers. Next, I will highlight my research on the security of AI systems, focusing on retrieval-augmented generation (RAG)-based LLMs and autonomous vehicles. For RAG-based LLMs, my ConfusedPilot work demonstrates how an attacker can compromise confidentiality and integrity guarantees by sharing a maliciously crafted document. For autonomous vehicles, I reveal a software-based cache side-channel attack capable of leaking the physical location of a vehicle without detection. Finally, I will outline future directions for building secure systems using AI methods and ensuring the security of AI systems. Bio: Mulong Luo is currently a postdoctoral researcher at the University of Texas at Austin hosted by Mohit Tiwari. His research interests lie broadly in applying AI methods for computer architecture and system security, as well as improving the security of AI systems including LLM and autonomous vehicles. He is selected as a CPS Rising Star 2023. His paper is selected as a finalist in Top Picks in Hardware and Embedded Security 2022. He is also awarded the best paper award at CPS-SPC 2018. Mulong received Ph.D. at Cornell University advised by Edward Suh in 2023. He got his MS and BS from UCSD and Peking University respectively.
October 30, 2024, Filed Under: 2024 Fall Semester, Current Semester[Series 06] Leveraging Serial Interfaces to Scale the Memory Wall in Server Architectures Speaker: Dr. Alexandros Daglis Title: Leveraging Serial Interfaces to Scale the Memory Wall in Server Architectures Date: November 1st, 2024 at 1:30 pm Location: EER 0.806/0.808 or Zoom Link Abstract: The memory system has historically been a primary performance determinant for server-grade computers. The multi-faceted challenges it poses is commonly referred to as the “memory wall”, referring to rigid capacity, bandwidth, and cost constraints. Current technological trends motivate a memory architecture rethink by leveraging serial interfaces, opening opportunities to overcome current limitations. Specifically, these opportunities are embodied by the emerging Compute Express Link (CXL) technology, which is garnering widespread adoption in the industry. CXL is well-positioned to revolutionize the way server systems are built and deployed, as it enables new capabilities in memory system design. CXL-centric or CXL-augmented memory systems bear characteristics that cater well to the growing demands of modern workloads. This talk will focus on two new CXL-centric memory systems for server architectures. First, we will see how a CXL-only memory system can drastically benefit modern manycore CPUs handling bandwidth-intensive workloads, despite the CXL interface’s seemingly prohibitive latency premium. Second, we will study how CXL’s memory pooling capability can be leveraged to accelerate workloads with little data locality on large-scale multi-socket NUMA systems. Both architectural approaches promise performance gains of up to 3x for their respective workload domain. Bio: Alexandros (Alex) Daglis is an Assistant Professor of Computer Science at Georgia Tech. Daglis’ research interests lie in computer architecture, with specific interests in datacenter architectures, network-compute co-design, and memory systems. His research has been supported by the NSF, IARPA, Speculative Technologies, Samsung, and Intel Corporation, and routinely appears at top-tier computer architecture venues such as ISCA, MICRO, ASPLOS, and HPCA. Daglis is a recipient of the NSF CAREER award, a Google Faculty Research Award, and a Georgia Tech Junior Faculty Teaching Award, and his PhD thesis (EPFL, 2018) was recognized with an ACM SIGARCH/IEEE CS TCCA Outstanding Dissertation Honorable Mention.
October 24, 2024, Filed Under: 2024 Fall Semester, Current Semester[Series 05] FPGA Device and Floorplan-aware Accelerator Implementation via Domain-specific Tooling Speaker: Dinesh Gaitonde, Abhishek Kumar Jain Title: FPGA Device and Floorplan-aware Accelerator Implementation via Domain-specific Tooling Date: October 29th, 2024 at 3:30 pm Location: EER 3.646 or Zoom Link Abstract: Reconfigurable devices, including AIE CGRA, FPGA fabric, HBM stacks, System-wide NoC, and ARM processing sub-system, offer diverse design options due to their heterogeneous nature. Design tools such as Vitis use a push-button approach, where application RTL is generated via HLS and then undergoes synthesis, placement, and routing in Vivado. This method often yields sub-optimal results (PPA) because high-level design semantics, such as processing element structure, composition, memory hierarchy, and interconnect, are lost during implementation. Therefore, the challenge lies in designing accelerators on FPGAs to fully use the FPGA resources, while still preserving designer productivity while still leveraging design and device characteristics. This presentation will focus on a few projects in our team (AMD FPGA architecture group) that deal with problems in diverse domains by exploiting the semantics of the problem being mapped and the specifics of the architecture to which it is mapped to. Existing SpMV accelerators on HBM-enabled FPGA platforms do not scale well, resulting in underutilization of the HBM bandwidth. Poor scaling of existing accelerator designs prevents us from using the entire HBM bandwidth. Physically unaware system design prevents us from achieving high frequency of operation. To address these issues, we propose a modular and leanSpMV dataflow accelerator and then implement it on FPGA fabric using our “Atoms” methodology. This is the first work that can use the entire bandwidth of commercial HBM-enabled FPGAs and surpass all reported frequencies while doing so. The “Atoms” methodology relies on exploiting design semantics to generate efficient floorplans for the design. We decompose the design into smaller building blocks communicating over latency-insensitive elastic channels. To navigate the heterogeneous canvas that modern FPGAs present, we add the required number of elastic buffers so that communication never becomes the frequency limiter. We expect this pattern to apply to a wide variety of other domains as well. The second project focuses on streaming neural networks, specifically – the FINN framework. FINN takes a high-level description of a network and then generates RTL followed by FPGA implementation. Depending on the resource budget, FINN can generate bunch of designs with varying throughput and resource requirements. One of the key building blocks in FINN generated network is the streaming Matrix Vector multiplication Unit (MVU). We propose to design MVU in a structured way so that we can extract maximum performance out of the device resources. DSP blocks can achieve close to 1 GHz on latest Versal FPGAs, and we are hoping to generate MVU units which also operate close to this limit. We plan to create an overlay MVU (with high fmax) which is instruction-programmable but does not exhibit overheads associated with usual overlays. All the blocks in our overlay MVU are supposed to be highly customized for FINN, specifically the DSP-based dot-engine ALU, register files for activation and weights, and instruction memory as well. Our approach relies on elastic communication between building blocks so that we can insert pipeline stages even after blocks are placed on the FPGA fabric. The third project is about packet processing using FPGA networking overlay also referred to as Packet Processing Engine (PPE). PPE is instruction-programmable, and AMD’s compiler can compile networking workloads (expressed in eBPF) on PPE. We are currently exploring ways to customize the overlay once we have compiled a networking workload on top of it. Our hope is to generate workload-specific PPEs mapped on FPGA fabric so that we do not have to pay the “overlay tax”. Finally, we present how some aspects of the problem faced by verification customers is one that is amenable to structured implmentation. We demonstrate how similar ideas discussed so far help us significantly improve the performance and lower the resources used for those workloads. Over the long term, we expect to develop a set of domain-specific optimized implementation flows that exploit a handful of basic concepts. Since the entire flow (including implementation) is aware of how the physical architecture of the FPGA interacts with the specific design needs, we expect the proposed flow to result in higher performance implementations compared to simply handing off a design at RTL after synthesis from some general-purpose HLS engine. Bio: Dinesh Gaitonde got his Bachelor’s and Master’s from IIT Bombay and his PhD from CMU in Electrical Engineering. He is currently a Senior Fellow at AMD (Xilinx) focusing on FPGA architectures, applications and implementation algorithms. Previous to AMD he has worked at Motorola, Synopsys as an EDA researcher. His interests include FPGA & Other Reconfigurable Fabrics, High Performance Computing on Reconfigurable Fabrics and EDA for FPGAs. Abhishek Kumar Jain received the PhD degree in computer engineering from Nanyang Technological University, Singapore, in 2016. After that, he was a postdoc at Lawrence Livermore National Laboratory. Since 2018, he has been an architect with Xilinx USA. His research interests include computer architecture, FPGAs, high-performance accelerators, and domain-specific FPGA overlays
October 7, 2024, Filed Under: 2024 Fall Semester, Current Semester[Series 04] Characterization of network proxies in micro-service orchestration Speaker: Prateek Sahu Title: Characterization of network proxies in micro-service orchestration Date: October 8th, 2024 at 3:30 pm Location: EER 3.646 or Zoom Link Abstract: Network proxies, aka sidecars, are used by organizations to manage and run hundreds of cloud microservices in a consistent manner. Since sidecars interpose on network traffic to provide telemetry and security features, they can degrade critical service level metrics such as latency and throughput. However, the precise impact of sidecars on such key metrics is unclear. We introduce SCoPE to quantify service-layer overheads as well as the micro-architectural implications of using sidecars in service meshes – and characterize these overheads across a range of sidecar configurations. SCoPE demonstrates that sidecars can degrade latency and throughput by up to 150% and 35%, respectively, across common benchmark applications. We find that the absolute overheads of the sidecars are independent of the microservices being proxied and depend on the proxy configuration and the microservice topology. Our micro-architectural analysis of sidecars indicates no discernible reuse of the instruction caches (i.e., poor misses per kilo instructions/MPKI) despite high-frequency reuse of sidecars. We note that increasing the private caches from 256KB to 1.25MB across processor generations sees only a 10% improvement in the processor frontend – this is due to high indirect branch misses and thrashing from more aggressive prefetchers and predictors that degrade the L1-I cache MPKIs up to 40%. Our analysis also finds that utilizing a few large pages can reduce iTLB misses and page walks by 80% at the cost of modest memory overheads. Bio: Prateek is a 5th year PhD student in ACSES, working with Dr. Mohit Tiwari in the SPARK Research Lab. His interests include hardware and systems security. He is currently working towards cross stack system security and orchestration while his prior work have included hardware side-channel attacks and defenses.
October 1, 2024, Filed Under: 2024 Fall Semester, Current Semester[Series 03] Reliable Processing-in-Memory Speaker: Jeageun Jung Title: Reliable Processing-in-Memory Date: October 1st, 2024 at 3:30 pm Location: EER 3.646 or Zoom Link Abstract: Processing-in-memory (PIM) architectures enhance performance by integrating compute units near memory but introduce reliability challenges. Bank-PIMs maximize performance by placing compute units near memory banks but limit error-checking and correcting (ECC) to local domains, making it insufficient to handle faults and scaling-induced errors. Bio: Jeageun Jung’s research addresses this reliability gap by developing a PIM-specific ECC scheme tuned for the expected fault and error patterns expected in near-bank PIMs. To do this, Jeageun Jung also develops a new DRAM physical fault model based on empirical data that accurately predicts fault behavior across memory types.