October 7, 2024, Filed Under: 2024 Fall Semester, Current Semester[Series 04] Characterization of network proxies in micro-service orchestration Speaker: Prateek Sahu Title: Characterization of network proxies in micro-service orchestration Date: October 8th, 2024 at 3:30 pm Location: EER 3.646 or Zoom Link Abstract: Network proxies, aka sidecars, are used by organizations to manage and run hundreds of cloud microservices in a consistent manner. Since sidecars interpose on network traffic to provide telemetry and security features, they can degrade critical service level metrics such as latency and throughput. However, the precise impact of sidecars on such key metrics is unclear. We introduce SCoPE to quantify service-layer overheads as well as the micro-architectural implications of using sidecars in service meshes – and characterize these overheads across a range of sidecar configurations. SCoPE demonstrates that sidecars can degrade latency and throughput by up to 150% and 35%, respectively, across common benchmark applications. We find that the absolute overheads of the sidecars are independent of the microservices being proxied and depend on the proxy configuration and the microservice topology. Our micro-architectural analysis of sidecars indicates no discernible reuse of the instruction caches (i.e., poor misses per kilo instructions/MPKI) despite high-frequency reuse of sidecars. We note that increasing the private caches from 256KB to 1.25MB across processor generations sees only a 10% improvement in the processor frontend – this is due to high indirect branch misses and thrashing from more aggressive prefetchers and predictors that degrade the L1-I cache MPKIs up to 40%. Our analysis also finds that utilizing a few large pages can reduce iTLB misses and page walks by 80% at the cost of modest memory overheads. Bio: Prateek is a 5th year PhD student in ACSES, working with Dr. Mohit Tiwari in the SPARK Research Lab. His interests include hardware and systems security. He is currently working towards cross stack system security and orchestration while his prior work have included hardware side-channel attacks and defenses.
October 1, 2024, Filed Under: 2024 Fall Semester, Current Semester[Series 03] Reliable Processing-in-Memory Speaker: Jeageun Jung Title: Reliable Processing-in-Memory Date: October 1st, 2024 at 3:30 pm Location: EER 3.646 or Zoom Link Abstract: Processing-in-memory (PIM) architectures enhance performance by integrating compute units near memory but introduce reliability challenges. Bank-PIMs maximize performance by placing compute units near memory banks but limit error-checking and correcting (ECC) to local domains, making it insufficient to handle faults and scaling-induced errors. Bio: Jeageun Jung’s research addresses this reliability gap by developing a PIM-specific ECC scheme tuned for the expected fault and error patterns expected in near-bank PIMs. To do this, Jeageun Jung also develops a new DRAM physical fault model based on empirical data that accurately predicts fault behavior across memory types.
September 24, 2024, Filed Under: 2024 Fall Semester, Current Semester[Series 08] Enabling Efficient Memory Systems using Novel Compression Methods Title: Enabling Efficient Memory Systems using Novel Compression Methods Speaker: Per Stenström Chalmers University of Technology / ZeroPoint TechnologiesGoteborg, Sweden Date: Nov 7th, 2024 at 3:30 pm Location: EER 3.646 or Zoom Link Abstract: Using data compression methods in the memory hierarchy can improve theefficiency of memory systems by enabling higher effective cache capacity,more effective use of available memory bandwidth and by enabling highereffective main memory capacity. This can lead to substantially higherperformance and lower power consumption. However, to enable these valuesrequires highly effective compression algorithms that can be implementedwith low latency and high throughput. Research at Chalmers University ofTechnology and at ZeroPoint Technologies, a fabless startup company, hasyielded many new families of compression methods that are now beingcommercially deployed. This talk will present the major insights of morethan a decade of research on memory compression methods for the memoryhierarchy. The talk covers value-aware caches and statistical compressionof cache content, compression algorithms that are tuned to the data at handthrough data analysis using new clustering algorithms to allow forsubstantially higher memory bandwidth and compression infrastructuresthat expand capacity of main memory. Bio: Per Stenstrom is professor at Chalmers University of Technology. His research interests are in parallel computer architecture. He has authored or co-authored four textbooks, about 200 publications and twenty patents in this area. He has been program chairman of several top-tier IEEE and ACM conferences including IEEE/ACM Symposium on Computer Architecture and acts as Associate Editor of ACM TACO and Topical Editor IEEE Transaction on Computers. He is a Fellow of the ACM and the IEEE and a member of Academia Europaea and the Royal Swedish Academy of Engineering Sciences.
September 10, 2024, Filed Under: 2024 Fall Semester, Current Semester[Series 02] Leveraging the IRON AI Engine API to program the Ryzen™ AI NPU Title: Leveraging the IRON AI Engine API to program the Ryzen™ AI NPU Speaker: Kristof Denolf & Joseph Melber Date: September 17, 2024 at 3:30 pm Location: EER 3.646 or Zoom Link Abstract: Specialized hardware accelerators are abundantly available today including NPUs found in consumer laptops with AMD Ryzen™ AI CPUs. The NPU of AMD Ryzen™ AI devices includes an AI Engine array comprised of a set of VLIW vector processors, data movement accelerators (DMAs) and adaptable interconnect. By providing convenient software tool flows to program these devices, enthusiasts are enabled to productively harness the full capabilities of these powerful NPUs. IRON is a close-to-metal open-source toolkit enabling performance engineers to build fast and efficient, often specialized, designs through a set of Python language bindings around the mlir-aie dialect. The presentation will provide insights into the AI Engine compute and data movement capabilities supported in our tool flow. The speakers will demonstrate performance optimizations of increasingly complex designs by leveraging the unique architectural features of AI Engines. Bio: Kristof Denolf is a Fellow in AMD’s Research and Advanced Development group where he is working on energy-efficient computer vision and video processing applications to shape future AMD devices. He earned an M.Eng. in electronics from the Katholieke Hogeschool Brugge-Oostende (1998), now part of KULeuven, an M.Sc. in electronic system design from Leeds Beckett University (2000), and a Ph.D. from the Technical University Eindhoven (2007). He has over 25 years of combined research and industry experience at IMEC, Philips, Barco, Apple, Xilinx, and AMD. His main research interests are all aspects of the cost-efficient and dataflow-oriented design of video, vision, and graphics systems. Joseph Melber is a Senior Member of Technical Staff in AMD’s Research and Advanced Development group. At AMD, he is working on hardware architectures and compiler technologies for current and future AMD devices. He received a BS in electrical engineering from the University Buffalo, as well as MS and PhD degrees from the electrical and computer engineering department at Carnegie Mellon University. His research interests include runtime systems, compiler abstractions for data movement, and hardware prototypes for future adaptive heterogeneous computing architectures.
September 3, 2024, Filed Under: 2024 Fall Semester, Current Semester[Series 01a] Experimentally Understanding and Efficiently Mitigating DRAM Read Disturbance Title: Experimentally Understanding and Efficiently Mitigating DRAM Read Disturbance Speaker: Ataberk Olgun Date: September 10, 2024 at 3:30 pm Location: EER 3.650 or Zoom Link Talk abstract: DRAM chips are increasingly more vulnerable to read disturbance phenomena (e.g., RowHammer and RowPress), where repeatedly accessing DRAM rows causes bitflips in nearby rows due to DRAM density scaling. Even though many prior works develop various RowHammer solutions, these solutions incur non-negligible and increasingly higher system performance, energy, and hardware area overheads as RowHammer vulnerability worsens. In this talk, we will present our recent works on 1) understanding DRAM read disturbance in modern high bandwidth memory (HBM) chips, along with the open source infrastructure that enables experimental studies on state-of-the-art DRAM chips, and 2) performance-, energy-, and area-efficient system-level solutions to read disturbance. First, we describe the results of a detailed experimental analysis of read disturbance in six real HBM2 chips. We show that (1) the read disturbance vulnerability significantly varies between different HBM2 chips and between different components (e.g., 3D-stacked channels) inside a chip, (2) DRAM rows at the end and in the middle of a bank are more resilient to read disturbance, (3) fewer additional activations are sufficient to induce more read disturbance bitflips in a DRAM row if the row exhibits the first bitflip at a relatively high activation count, and (4) a modern HBM2 chip implements undocumented read disturbance defenses that track potential aggressor rows based on how many times they are activated. We also briefly describe the infrastructure that enabled the discoveries we made in our study on read disturbance in high bandwidth memory chips along with those made in multiple recent works that investigate read disturbance in real DRAM chips (e.g., RowPress). Second, we introduce ABACuS, a new low-cost hardware-counter-based RowHammer mitigation technique that performance-, energy-, and area-efficiently scales with worsening RowHammer vulnerability. ABACuS’s key idea is to use a single shared row activation counter to track activations to the rows with the same row address in all DRAM banks. Unlike state-of-the-art RowHammer mitigation mechanisms that implement a separate row activation counter for each DRAM bank, ABACuS implements fewer counters (e.g., only one) to track an equal number of aggressor rows. At very low RowHammer thresholds (where only 125 activations cause a bitflip), ABACuS induces small system performance and DRAM energy overhead, and outperforms and takes up smaller chip area than the state-of-the-art mitigation techniques (Hydra and Graphene). All data, sources, and paper PDFs for the described works are freely and openly available.– HBM Read Disturbance: https://github.com/CMU-SAFARI/HBM-Read-Disturbance, Paper PDF: https://arxiv.org/pdf/2310.14665– DRAM Bender: https://github.com/CMU-SAFARI/DRAM-Bender, Paper PDF: https://arxiv.org/pdf/2211.05838– ABACuS sources: https://github.com/CMU-SAFARI/ABACuS, Paper PDF: https://arxiv.org/pdf/2310.09977 Bio: Ataberk Olgun is a 3rd year PhD student at ETH Zurich. His broad research interests include designing secure, high-performance, and energy-efficient DRAM architectures. Especially with worsening RowHammer vulnerability, it is increasingly difficult to design new DRAM architectures that satisfy all three characteristics. His current research focuses on i) deeply understanding and ii) efficiently mitigating the RowHammer vulnerability in modern systems.
September 3, 2024, Filed Under: 2024 Fall Semester, Current Semester[Series 01b] Enabling the Adoption of Data-Centric Systems: Hardware/Software Support for Processing-Using-Memory Architectures Title: Enabling the Adoption of Data-Centric Systems: Hardware/Software Support for Processing-Using-Memory Architectures Speaker: Geraldo F. Oliveira Date: September 11, 2024 at 2:00 pm Location: EER 0.806/808 or Zoom Link Talk abstract: The increasing prevalence and growing size of data in modern applications have led to high performance and energy costs for computation in traditional processor-centric computing systems. To mitigate these costs, the processing-in-memory (PIM) paradigm moves computation closer to where the data resides, reducing (and sometimes eliminating) the need to move data between memory and the processor. There are two main approaches to PIM: (1) processing-near-memory (PNM), where PIM logic is added to the same die as memory or to the logic layer of 3D-stacked memory, and (2) processing-using-memory (PUM), which uses the operational principles of memory cells to perform computation. Due to a push from the application domain and recent developments in memory manufacturing and packaging, memory manufacturers (and startups) have finally introduced the first real-world PNM architectures into the market. However, fully adopting PUM in today’s systems is still very challenging due to the lack of tools and system support for such architectures across the computer architecture stack, which includes (i) frameworks that can facilitate the implementation of complex operations and algorithms using the underlying PUM primitives; (ii) execution models that can take advantage of the available application parallelism to maximize hardware utilization and throughput; (iii) compiler support and compiler optimizations targeting PUM architectures; (iv) operating system support for PUM-aware virtual memory and memory management. In this talk, we will discuss our major recent research results on different tools and system support for PUM architectures (with a focus on DRAM-based solutions), which aim to ease the adoption of such architectures in current and future systems. Our work builds on prior works ([1, 2]) that show that current DRAM chips can be modified slightly to execute simple data movement and Boolean operations, unleashing the PUM capabilities of current memory technologies. Based on that, we will first describe our efforts to extend the capabilities of PUM solutions further to enable their applicability to various workloads. To do so, we implement complex PUM operations using (i) SIMDRAM [3], an end-to-end framework that composes PUM primitives to implement complex arithmetic operations entirely within DRAM in a single-instruction multiple-data (SIMD) manner; and (ii) pLUTo [4], a PUM architecture that leverages the high storage density of DRAM to enable the massively parallel storing and querying of lookup tables (LUTs) instead of relying on complex extra in-DRAM logic. Second, we propose system solutions that expose the newly added PUM capabilities to the application stack, focusing on programmer-friendly approaches. Concretely, we will discuss MIMDRAM [5], a hardware/software co-designed PUM system that introduces the ability to allocate and control only the required amount of computing resources inside the DRAM subarray for PUM computation. MIMDRAM implements compiler passes and system support to guarantee high utilization of the PUM substrate. Third, we extensively analyze current commodity off-the-shelf (COTS) DRAM chips to characterize their capability to perform PUM operations with modifications only to the DRAM controller and not to the DRAM chip or interface [6]. We demonstrate that (i) PUM architectures are a promising solution, leading to significant (e.g., more than an order of magnitude) performance and energy gains compared to processor-centric systems for various real-world applications, and (2) COTS DRAM chips are capable of performing a range of PUM operations with high success rates. [1] V. Seshadri, Y. Kim et al., “RowClone: Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization,” in MICRO, 2013.[2] V. Seshadri, D. Lee et al., “Ambit: In-Memory Accelerator for Bulk Bitwise Operations Using Commodity DRAM Technology,” in MICRO, 2017.[3] N. Hajinazar, G. F. Oliveira et al., “SIMDRAM: A Framework for Bit-Serial SIMDProcessing Using DRAM,” in ASPLOS, 2021[4] J. D. Ferreira, G. Falcao et al., “pLUTo: Enabling Massively Parallel Computation in DRAM via Lookup Tables,” in MICRO, 2022.[5] G. F. Oliveira, A. Olgun et al., “MIMDRAM: An End-to-End Processing-UsingDRAM System for High-Throughput, Energy-Efficient and Programmer-Transparent Multiple-Instruction Multiple-Data Computing,” in HPCA, 2024.[6] I. E. Yuksel, Y. C. Tugrul et al., “Functionally-Complete Boolean Logic in Real DRAM Chips: Experimental Characterization and Analysis,” in HPCA, 2024. Bio: Geraldo F. Oliveira (https://geraldofojunior.github.io/) is a Ph.D. candidate in the Safari Research Group at ETH Zürich, working with Prof. Onur Mutlu. His current broader research interests are in computer architecture and systems, focusing on memory-centric architectures for high-performance and energy-efficient systems. In particular, his Ph.D. research focuses on taking advantage of new memory technologies to accelerate distinct classes of applications and provide system support for novel memory-centric systems. Geraldo has published several works on this topic in major conferences and journals such as HPCA, ASPLOS, ISCA, MICRO, and IEEE Micro.