PASTA: Programming and Automation Support for Scalable Task-Parallel HLS Programs on Modern Multi-Die FPGAs


In recent years, there has been increasing adoption of FPGAs in datacenters as hardware accelerators, where a large population of end users are software developers. While high-level synthesis (HLS) facilitates software programming, it is still challenging to scale large accelerator designs on modern datacenter FPGAs that often consist of multiple dies and memory banks. More specifically, routing congestion and extra delays on these multi-die FPGAs often cause timing closure issues and severe frequency degradation at the physical design level, which are difficult to digest and optimize for high-level programmers using HLS. One promising approach to mitigate such issues is to develop a high-level task-parallel programming model with HLS and physical design co-optimization. Unfortunately, existing studies only support a programming model where tasks communicate with each other via FIFOs, while many applications are not streaming friendly and many existing accelerator designs heavily rely on buffer based communication between tasks.
In this paper, we take a step further to support a task-parallel programming model where tasks can communicate via both FIFOs and buffers. To achieve this goal, we design and implement the PASTA framework, which takes a large task-parallel HLS design as input and automatically generates a high-frequency FPGA accelerator via HLS and physical design co-optimization. First, we design a decoupled latency-insensitive buffer channel that supports memory partitioning and ping-pong buffering, which is compatible with the vendor Vitis HLS compiler. In the frontend, we develop an easy-to-use programming interface to allow end users to use our buffer channel in their applications. In the backend, we provide automatic coarse-grained floorplanning and pipelining for designs that use our proposed buffer channel. We test PASTA on a set of task-parallel HLS designs that use buffers for task communication and show an average of 36% (up to 54%) frequency improvement for large design configurations.

In International Symposium On Field-Programmable Custom Computing Machines (FCCM), IEEE.