Design Automation and Optimization for Memory-Bound Application Accelerators

Abstract

As we witness the breakdown of Dennard scaling, we can no longer get faster computers by shrinking transistors without increasing power density. Yet, the amount of data to be processed has never stopped growing. The limited power budget builds a “power wall” between the ever-increasing demand for computation and the available computer hardware, which forces computer scientists to seek not only performant, but also power efficient computation solutions, especially in data centers. Moreover, the wide performance gap between the computation units and the memory builds a “memory wall” and limits performance from another dimension.
In the past decade, field-programmable gate arrays (FPGAs) have been rapidly adopted in data centers, thanks to their low power consumption and the reprogrammability that assemble highly power-efficient accelerators for memory-bound applications. Meanwhile, C-based high-level synthesis (HLS) has been growing as the FPGA acceleration market grows, bringing “hard-to-program” FPGA accelerators to a broader community in many application domains. However, to create efficient customized accelerators, FPGA-related expertise is still required for the domain experts when they write HLS C. To make it worse, even for experienced FPGA programmers, C-based HLS is often less productive compared with higher-level software languages, especially when an application cannot be easily programmed using the compiler directives designed for data-parallel programs.
This dissertation aims to address these two issues for domain-specific customizable accelerators for memory-bound applications with both regular and irregular memory access patterns. For memory-bound applications with regular memory accesses, we select stencil application as a representative for their complex data dependency that is challenging to optimize. We present SODA (Stencil with Optimized Dataflow Architecture) as a domain-specific compiler framework for FPGA accelerators. We show that by adopting theoretical analysis, model-driven design-space exploration, and domain-specific languages, programmers without FPGA expertise can build highly efficient stencil accelerators that outperform multi-thread CPUs by up to 3.3× with the memory bandwidth utilization improved by 1.65× on average. For memory-bound applications with irregular memory accesses, we select graph applications as a representative for their widespread presentation in various application domains. We first present TAPA (TAsk-PArallel) as a language extension to HLS, showing that convenient programming interfaces, universal software simulation, and hierarchical code generation can greatly improve productivity for task-parallel programs and reduce programmers’ burden. We then extend our effort to support dynamically scheduled memory accesses to cover more applications and further improve productivity. Finally, we show with two case studies from real-world graph applications, i.e., single-source shortest path for neural image reconstruction and graph convolutional neural network for learning on graph structure, that customizable accelerators can achieve up to 4.9× speedup over state-of-the-art FPGA accelerators and 2.6× speedup over state-of-the-art multi-thread CPU implementation.