UT Austin Computer Architecture

UT Austin Computer Architecture Seminar Series

Welcome to the Computer Architecture Seminar Series at The University of Texas at Austin. Our seminar talks are open to all interested, whether affiliated to UT or not, at no charge. No prior registration is needed, feel free to just show up!

Location: EER, near the northeast corner of 24th and Speedway,
(Typically) Tuesdays from 3:30pm – 5:00pm. Check talk details to confirm as day, time, and location can vary depending on speaker

Please see below for this semester’s schedule.

The Information page includes more details, including how to join our mailing list.

Current Schedule

Welcome to CompArch 2025 Spring

UT Austin Computer Architecture Seminar Series 2025 Spring

UT Austin Computer Architecture Seminar Series 2024 Fall

Sponsored by:

Date	Series	Topic	Speaker
September 10, 2024	Series 01a	Experimentally Understanding and Efficiently Mitigating DRAM Read Disturbance	Ataberk Olgun, ETH Zürich
September 11, 2024	Series 01b	Enabling the Adoption of Data-Centric Systems: Hardware/Software Support for Processing-Using-Memory Architectures	Geraldo F. Oliveira, ETH Zürich
September 17, 2024	Series 02	Leveraging the IRON AI Engine API to program the Ryzen™ AI NPU	Kristof Denolf & Joseph Melber
October 1, 2024	Series 03	Reliable Processing-in-Memory	Jeageun Jung
October 8, 2024	Series 04	Characterization of network proxies in micro-service orchestration	Prateek Sahu
October 29, 2024	Series 05	FPGA Device and Floorplan-aware Accelerator Implementation via Domain-specific Tooling	Dinesh Gaitonde & Abhishek Kumar Jain
November 1, 2024	Series 06	Leveraging Serial Interfaces to Scale the Memory Wall in Server Architectures	Alexandros Daglis
November 7, 2024	Series 07	Resource-efficient AI System Design	Ana Klimović
November 7, 2024	Series 08	Enabling Efficient Memory Systems using Novel Compression Methods	Per Stenström

[Series 06] Leveraging Serial Interfaces to Scale the Memory Wall in Server Architectures

Speaker: Dr. Alexandros Daglis

Title: Leveraging Serial Interfaces to Scale the Memory Wall in Server Architectures

Date: November 1st, 2024 at 1:30 pm

Location: EER 0.806/0.808 or Zoom Link

Abstract:

The memory system has historically been a primary performance determinant for server-grade computers. The multi-faceted challenges it poses is commonly referred to as the “memory wall”, referring to rigid capacity, bandwidth, and cost constraints. Current technological trends motivate a memory architecture rethink by leveraging serial interfaces, opening opportunities to overcome current limitations. Specifically, these opportunities are embodied by the emerging Compute Express Link (CXL) technology, which is garnering widespread adoption in the industry. CXL is well-positioned to revolutionize the way server systems are built and deployed, as it enables new capabilities in memory system design. CXL-centric or CXL-augmented memory systems bear characteristics that cater well to the growing demands of modern workloads. This talk will focus on two new CXL-centric memory systems for server architectures. First, we will see how a CXL-only memory system can drastically benefit modern manycore CPUs handling bandwidth-intensive workloads, despite the CXL interface’s seemingly prohibitive latency premium. Second, we will study how CXL’s memory pooling capability can be leveraged to accelerate workloads with little data locality on large-scale multi-socket NUMA systems. Both architectural approaches promise performance gains of up to 3x for their respective workload domain.

Bio:

Alexandros (Alex) Daglis is an Assistant Professor of Computer Science at Georgia Tech. Daglis’ research interests lie in computer architecture, with specific interests in datacenter architectures, network-compute co-design, and memory systems. His research has been supported by the NSF, IARPA, Speculative Technologies, Samsung, and Intel Corporation, and routinely appears at top-tier computer architecture venues such as ISCA, MICRO, ASPLOS, and HPCA. Daglis is a recipient of the NSF CAREER award, a Google Faculty Research Award, and a Georgia Tech Junior Faculty Teaching Award, and his PhD thesis (EPFL, 2018) was recognized with an ACM SIGARCH/IEEE CS TCCA Outstanding Dissertation Honorable Mention.

[Series 05] FPGA Device and Floorplan-aware Accelerator Implementation via Domain-specific Tooling

Speaker: Dinesh Gaitonde, Abhishek Kumar Jain

Title: FPGA Device and Floorplan-aware Accelerator Implementation via Domain-specific Tooling

Date: October 29th, 2024 at 3:30 pm

Location: EER 3.646 or Zoom Link

Abstract:

Reconfigurable devices, including AIE CGRA, FPGA fabric, HBM stacks, System-wide NoC, and ARM processing sub-system, offer diverse design options due to their heterogeneous nature. Design tools such as Vitis use a push-button approach, where application RTL is generated via HLS and then undergoes synthesis, placement, and routing in Vivado. This method often yields sub-optimal results (PPA) because high-level design semantics, such as processing element structure, composition, memory hierarchy, and interconnect, are lost during implementation. Therefore, the challenge lies in designing accelerators on FPGAs to fully use the FPGA resources, while still preserving designer productivity while still leveraging design and device characteristics.

This presentation will focus on a few projects in our team (AMD FPGA architecture group) that deal with problems in diverse domains by exploiting the semantics of the problem being mapped and the specifics of the architecture to which it is mapped to.

Existing SpMV accelerators on HBM-enabled FPGA platforms do not scale well, resulting in underutilization of the HBM bandwidth. Poor scaling of existing accelerator designs prevents us from using the entire HBM bandwidth. Physically unaware system design prevents us from achieving high frequency of operation. To address these issues, we propose a modular and leanSpMV dataflow accelerator and then implement it on FPGA fabric using our “Atoms” methodology. This is the first work that can use the entire bandwidth of commercial HBM-enabled FPGAs and surpass all reported frequencies while doing so. The “Atoms” methodology relies on exploiting design semantics to generate efficient floorplans for the design. We decompose the design into smaller building blocks communicating over latency-insensitive elastic channels. To navigate the heterogeneous canvas that modern FPGAs present, we add the required number of elastic buffers so that communication never becomes the frequency limiter. We expect this pattern to apply to a wide variety of other domains as well.

The second project focuses on streaming neural networks, specifically – the FINN framework. FINN takes a high-level description of a network and then generates RTL followed by FPGA implementation. Depending on the resource budget, FINN can generate bunch of designs with varying throughput and resource requirements. One of the key building blocks in FINN generated network is the streaming Matrix Vector multiplication Unit (MVU). We propose to design MVU in a structured way so that we can extract maximum performance out of the device resources. DSP blocks can achieve close to 1 GHz on latest Versal FPGAs, and we are hoping to generate MVU units which also operate close to this limit. We plan to create an overlay MVU (with high fmax) which is instruction-programmable but does not exhibit overheads associated with usual overlays. All the blocks in our overlay MVU are supposed to be highly customized for FINN, specifically the DSP-based dot-engine ALU, register files for activation and weights, and instruction memory as well. Our approach relies on elastic communication between building blocks so that we can insert pipeline stages even after blocks are placed on the FPGA fabric.

The third project is about packet processing using FPGA networking overlay also referred to as Packet Processing Engine (PPE). PPE is instruction-programmable, and AMD’s compiler can compile networking workloads (expressed in eBPF) on PPE. We are currently exploring ways to customize the overlay once we have compiled a networking workload on top of it. Our hope is to generate workload-specific PPEs mapped on FPGA fabric so that we do not have to pay the “overlay tax”.

Finally, we present how some aspects of the problem faced by verification customers is one that is amenable to structured implmentation. We demonstrate how similar ideas discussed so far help us significantly improve the performance and lower the resources used for those workloads.

Over the long term, we expect to develop a set of domain-specific optimized implementation flows that exploit a handful of basic concepts. Since the entire flow (including implementation) is aware of how the physical architecture of the FPGA interacts with the specific design needs, we expect the proposed flow to result in higher performance implementations compared to simply handing off a design at RTL after synthesis from some general-purpose HLS engine.

Bio:

Dinesh Gaitonde got his Bachelor’s and Master’s from IIT Bombay and his PhD from CMU in Electrical Engineering. He is currently a Senior Fellow at AMD (Xilinx) focusing on FPGA architectures, applications and implementation algorithms. Previous to AMD he has worked at Motorola, Synopsys as an EDA researcher. His interests include FPGA & Other Reconfigurable Fabrics, High Performance Computing on Reconfigurable Fabrics and EDA for FPGAs.

Abhishek Kumar Jain received the PhD degree in computer engineering from Nanyang Technological University, Singapore, in 2016. After that, he was a postdoc at Lawrence Livermore National Laboratory. Since 2018, he has been an architect with Xilinx USA. His research interests include computer architecture, FPGAs, high-performance accelerators, and domain-specific FPGA overlays

[Series 04] Characterization of network proxies in micro-service orchestration

Speaker: Prateek Sahu

Title: Characterization of network proxies in micro-service orchestration

Date: October 8th, 2024 at 3:30 pm

Location: EER 3.646 or Zoom Link

Abstract:

Network proxies, aka sidecars, are used by organizations to manage and run hundreds of cloud microservices in a consistent manner. Since sidecars interpose on network traffic to provide telemetry and security features, they can degrade critical service level metrics such as latency and throughput. However, the precise impact of sidecars on such key metrics is unclear. We introduce SCoPE to quantify service-layer overheads as well as the micro-architectural implications of using sidecars in service meshes – and characterize these overheads across a range of sidecar configurations. SCoPE demonstrates that sidecars can degrade latency and throughput by up to 150% and 35%, respectively, across common benchmark applications. We find that the absolute overheads of the sidecars are independent of the microservices being proxied and depend on the proxy configuration and the microservice topology. Our micro-architectural analysis of sidecars indicates no discernible reuse of the instruction caches (i.e., poor misses per kilo instructions/MPKI) despite high-frequency reuse of sidecars. We note that increasing the private caches from 256KB to 1.25MB across processor generations sees only a 10% improvement in the processor frontend – this is due to high indirect branch misses and thrashing from more aggressive prefetchers and predictors that degrade the L1-I cache MPKIs up to 40%. Our analysis also finds that utilizing a few large pages can reduce iTLB misses and page walks by 80% at the cost of modest memory overheads.

Bio:

Prateek is a 5th year PhD student in ACSES, working with Dr. Mohit Tiwari in the SPARK Research Lab. His interests include hardware and systems security. He is currently working towards cross stack system security and orchestration while his prior work have included hardware side-channel attacks and defenses.

[Series 03] Reliable Processing-in-Memory

Speaker: Jeageun Jung

Title: Reliable Processing-in-Memory

Date: October 1st, 2024 at 3:30 pm

Location: EER 3.646 or Zoom Link

Abstract:

Processing-in-memory (PIM) architectures enhance performance by integrating compute units near memory but introduce reliability challenges. Bank-PIMs maximize performance by placing compute units near memory banks but limit error-checking and correcting (ECC) to local domains, making it insufficient to handle faults and scaling-induced errors.

Bio:

Jeageun Jung’s research addresses this reliability gap by developing a PIM-specific ECC scheme tuned for the expected fault and error patterns expected in near-bank PIMs. To do this, Jeageun Jung also develops a new DRAM physical fault model based on empirical data that accurately predicts fault behavior across memory types.

[Series 08] Enabling Efficient Memory Systems using Novel Compression Methods

Title: Enabling Efficient Memory Systems using Novel Compression Methods

Speaker: Per Stenström

Chalmers University of Technology / ZeroPoint Technologies
Goteborg, Sweden

Date: Nov 7th, 2024 at 3:30 pm

Location: EER 3.646 or Zoom Link

Abstract:

Using data compression methods in the memory hierarchy can improve the
efficiency of memory systems by enabling higher effective cache capacity,
more effective use of available memory bandwidth and by enabling higher
effective main memory capacity. This can lead to substantially higher
performance and lower power consumption. However, to enable these values
requires highly effective compression algorithms that can be implemented
with low latency and high throughput. Research at Chalmers University of
Technology and at ZeroPoint Technologies, a fabless startup company, has
yielded many new families of compression methods that are now being
commercially deployed. This talk will present the major insights of more
than a decade of research on memory compression methods for the memory
hierarchy. The talk covers value-aware caches and statistical compression
of cache content, compression algorithms that are tuned to the data at hand
through data analysis using new clustering algorithms to allow for
substantially higher memory bandwidth and compression infrastructures
that expand capacity of main memory.

Bio:

Per Stenstrom is professor at Chalmers University of Technology. His research
interests are in parallel computer architecture. He has authored or
co-authored four textbooks, about 200 publications and twenty patents in
this area. He has been program chairman of several top-tier IEEE and ACM
conferences including IEEE/ACM Symposium on Computer Architecture and acts
as Associate Editor of ACM TACO and Topical Editor IEEE Transaction on
Computers. He is a Fellow of the ACM and the IEEE and a member of Academia
Europaea and the Royal Swedish Academy of Engineering Sciences.

[Series 02] Leveraging the IRON AI Engine API to program the Ryzen™ AI NPU

Title: Leveraging the IRON AI Engine API to program the Ryzen™ AI NPU

Speaker: Kristof Denolf & Joseph Melber

Date: September 17, 2024 at 3:30 pm

Location: EER 3.646 or Zoom Link

Abstract: Specialized hardware accelerators are abundantly available today including NPUs found in consumer laptops with AMD Ryzen™ AI CPUs. The NPU of AMD Ryzen™ AI devices includes an AI Engine array comprised of a set of VLIW vector processors, data movement accelerators (DMAs) and adaptable interconnect. By providing convenient software tool flows to program these devices, enthusiasts are enabled to productively harness the full capabilities of these powerful NPUs. IRON is a close-to-metal open-source toolkit enabling performance engineers to build fast and efficient, often specialized, designs through a set of Python language bindings around the mlir-aie dialect. The presentation will provide insights into the AI Engine compute and data movement capabilities supported in our tool flow. The speakers will demonstrate performance optimizations of increasingly complex designs by leveraging the unique architectural features of AI Engines.

Bio:

Kristof Denolf is a Fellow in AMD’s Research and Advanced Development group where he is working on energy-efficient computer vision and video processing applications to shape future AMD devices. He earned an M.Eng. in electronics from the Katholieke Hogeschool Brugge-Oostende (1998), now part of KULeuven, an M.Sc. in electronic system design from Leeds Beckett University (2000), and a Ph.D. from the Technical University Eindhoven (2007). He has over 25 years of combined research and industry experience at IMEC, Philips, Barco, Apple, Xilinx, and AMD. His main research interests are all aspects of the cost-efficient and dataflow-oriented design of video, vision, and graphics systems.

Joseph Melber is a Senior Member of Technical Staff in AMD’s Research and Advanced Development group. At AMD, he is working on hardware architectures and compiler technologies for current and future AMD devices. He received a BS in electrical engineering from the University Buffalo, as well as MS and PhD degrees from the electrical and computer engineering department at Carnegie Mellon University. His research interests include runtime systems, compiler abstractions for data movement, and hardware prototypes for future adaptive heterogeneous computing architectures.

[Series 01a] Experimentally Understanding and Efficiently Mitigating DRAM Read Disturbance

Title: Experimentally Understanding and Efficiently Mitigating DRAM Read Disturbance

Speaker: Ataberk Olgun

Date: September 10, 2024 at 3:30 pm

Location: EER 3.650 or Zoom Link

Talk abstract: DRAM chips are increasingly more vulnerable to read disturbance phenomena (e.g., RowHammer and RowPress), where repeatedly accessing DRAM rows causes bitflips in nearby rows due to DRAM density scaling. Even though many prior works develop various RowHammer solutions, these solutions incur non-negligible and increasingly higher system performance, energy, and hardware area overheads as RowHammer vulnerability worsens.

In this talk, we will present our recent works on 1) understanding DRAM read disturbance in modern high bandwidth memory (HBM) chips, along with the open source infrastructure that enables experimental studies on state-of-the-art DRAM chips, and 2) performance-, energy-, and area-efficient system-level solutions to read disturbance. First, we describe the results of a detailed experimental analysis of read disturbance in six real HBM2 chips. We show that (1) the read disturbance vulnerability significantly varies between different HBM2 chips and between different components (e.g., 3D-stacked channels) inside a chip, (2) DRAM rows at the end and in the middle of a bank are more resilient to read disturbance, (3) fewer additional activations are sufficient to induce more read disturbance bitflips in a DRAM row if the row exhibits the first bitflip at a relatively high activation count, and (4) a modern HBM2 chip implements undocumented read disturbance defenses that track potential aggressor rows based on how many times they are activated. We also briefly describe the infrastructure that enabled the discoveries we made in our study on read disturbance in high bandwidth memory chips along with those made in multiple recent works that investigate read disturbance in real DRAM chips (e.g., RowPress).

Second, we introduce ABACuS, a new low-cost hardware-counter-based RowHammer mitigation technique that performance-, energy-, and area-efficiently scales with worsening RowHammer vulnerability. ABACuS’s key idea is to use a single shared row activation counter to track activations to the rows with the same row address in all DRAM banks. Unlike state-of-the-art RowHammer mitigation mechanisms that implement a separate row activation counter for each DRAM bank, ABACuS implements fewer counters (e.g., only one) to track an equal number of aggressor rows. At very low RowHammer thresholds (where only 125 activations cause a bitflip), ABACuS induces small system performance and DRAM energy overhead, and outperforms and takes up smaller chip area than the state-of-the-art mitigation techniques (Hydra and Graphene).

All data, sources, and paper PDFs for the described works are freely and openly available.
– HBM Read Disturbance: https://github.com/CMU-SAFARI/HBM-Read-Disturbance, Paper PDF: https://arxiv.org/pdf/2310.14665
– DRAM Bender: https://github.com/CMU-SAFARI/DRAM-Bender, Paper PDF: https://arxiv.org/pdf/2211.05838
– ABACuS sources: https://github.com/CMU-SAFARI/ABACuS, Paper PDF: https://arxiv.org/pdf/2310.09977

Bio: Ataberk Olgun is a 3rd year PhD student at ETH Zurich. His broad research interests include designing secure, high-performance, and energy-efficient DRAM architectures. Especially with worsening RowHammer vulnerability, it is increasingly difficult to design new DRAM architectures that satisfy all three characteristics. His current research focuses on i) deeply understanding and ii) efficiently mitigating the RowHammer vulnerability in modern systems.

[Series 01b] Enabling the Adoption of Data-Centric Systems: Hardware/Software Support for Processing-Using-Memory Architectures

Title: Enabling the Adoption of Data-Centric Systems: Hardware/Software Support for Processing-Using-Memory Architectures

Speaker: Geraldo F. Oliveira

Date: September 11, 2024 at 2:00 pm

Location: EER 0.806/808 or Zoom Link

Talk abstract: The increasing prevalence and growing size of data in modern applications have led to high performance and energy costs for computation in traditional processor-centric computing systems. To mitigate these costs, the processing-in-memory (PIM) paradigm moves computation closer to where the data resides, reducing (and sometimes eliminating) the need to move data between memory and the processor. There are two main approaches to PIM: (1) processing-near-memory (PNM), where PIM logic is added to the same die as memory or to the logic layer of 3D-stacked memory, and (2) processing-using-memory (PUM), which uses the operational principles of memory cells to perform computation. Due to a push from the application domain and recent developments in memory manufacturing and packaging, memory manufacturers (and startups) have finally introduced the first real-world PNM architectures into the market. However, fully adopting PUM in today’s systems is still very challenging due to the lack of tools and system support for such architectures across the computer architecture stack, which includes (i) frameworks that can facilitate the implementation of complex operations and algorithms using the underlying PUM primitives; (ii) execution models that can take advantage of the available application parallelism to maximize hardware utilization and throughput; (iii) compiler support and compiler optimizations targeting PUM architectures; (iv) operating system support for PUM-aware virtual memory and memory management.

In this talk, we will discuss our major recent research results on different tools and system support for PUM architectures (with a focus on DRAM-based solutions), which aim to ease the adoption of such architectures in current and future systems. Our work builds on prior works ([1, 2]) that show that current DRAM chips can be modified slightly to execute simple data movement and Boolean operations, unleashing the PUM capabilities of current memory technologies. Based on that, we will first describe our efforts to extend the capabilities of PUM solutions further to enable their applicability to various workloads. To do so, we implement complex PUM operations using (i) SIMDRAM [3], an end-to-end framework that composes PUM primitives to implement complex arithmetic operations entirely within DRAM in a single-instruction multiple-data (SIMD) manner; and (ii) pLUTo [4], a PUM architecture that leverages the high storage density of DRAM to enable the massively parallel storing and querying of lookup tables (LUTs) instead of relying on complex extra in-DRAM logic. Second, we propose system solutions that expose the newly added PUM capabilities to the application stack, focusing on programmer-friendly approaches. Concretely, we will discuss MIMDRAM [5], a hardware/software co-designed PUM system that introduces the ability to allocate and control only the required amount of computing resources inside the DRAM subarray for PUM computation. MIMDRAM implements compiler passes and system support to guarantee high utilization of the PUM substrate. Third, we extensively analyze current commodity off-the-shelf (COTS) DRAM chips to characterize their capability to perform PUM operations with modifications only to the DRAM controller and not to the DRAM chip or interface [6]. We demonstrate that (i) PUM architectures are a promising solution, leading to significant (e.g., more than an order of magnitude) performance and energy gains compared to processor-centric systems for various real-world applications, and (2) COTS DRAM chips are capable of performing a range of PUM operations with high success rates.

[1] V. Seshadri, Y. Kim et al., “RowClone: Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization,” in MICRO, 2013.
[2] V. Seshadri, D. Lee et al., “Ambit: In-Memory Accelerator for Bulk Bitwise Operations Using Commodity DRAM Technology,” in MICRO, 2017.
[3] N. Hajinazar, G. F. Oliveira et al., “SIMDRAM: A Framework for Bit-Serial SIMD
Processing Using DRAM,” in ASPLOS, 2021
[4] J. D. Ferreira, G. Falcao et al., “pLUTo: Enabling Massively Parallel Computation in DRAM via Lookup Tables,” in MICRO, 2022.
[5] G. F. Oliveira, A. Olgun et al., “MIMDRAM: An End-to-End Processing-UsingDRAM System for High-Throughput, Energy-Efficient and Programmer-Transparent Multiple-Instruction Multiple-Data Computing,” in HPCA, 2024.
[6] I. E. Yuksel, Y. C. Tugrul et al., “Functionally-Complete Boolean Logic in Real DRAM Chips: Experimental Characterization and Analysis,” in HPCA, 2024.

Bio: Geraldo F. Oliveira (https://geraldofojunior.github.io/) is a Ph.D. candidate in the Safari Research Group at ETH Zürich, working with Prof. Onur Mutlu. His current broader research interests are in computer architecture and systems, focusing on memory-centric architectures for high-performance and energy-efficient systems. In particular, his Ph.D. research focuses on taking advantage of new memory technologies to accelerate distinct classes of applications and provide system support for novel memory-centric systems. Geraldo has published several works on this topic in major conferences and journals such as HPCA, ASPLOS, ISCA, MICRO, and IEEE Micro.

UT Austin Computer Architecture Seminar Series

Welcome to CompArch 2025 Spring

UT Austin Computer Architecture Seminar Series 2025 Spring

Sponsored by:

[Series 01] Securing Computer Systems using AI Methods and for AI Applications

Welcome to CompArch 2024 Fall

UT Austin Computer Architecture Seminar Series 2024 Fall

Sponsored by:

[Series 06] Leveraging Serial Interfaces to Scale the Memory Wall in Server Architectures

[Series 05] FPGA Device and Floorplan-aware Accelerator Implementation via Domain-specific Tooling

[Series 04] Characterization of network proxies in micro-service orchestration

[Series 03] Reliable Processing-in-Memory

[Series 08] Enabling Efficient Memory Systems using Novel Compression Methods

[Series 02] Leveraging the IRON AI Engine API to program the Ryzen™ AI NPU

[Series 01a] Experimentally Understanding and Efficiently Mitigating DRAM Read Disturbance

[Series 01b] Enabling the Adoption of Data-Centric Systems: Hardware/Software Support for Processing-Using-Memory Architectures

Past Semesters