Publications

PASTA: Programming and Automation Support for Scalable Task-Parallel HLS Programs on Modern Multi-Die FPGAs

In recent years, there has been increasing adoption of FPGAs in datacenters as hardware accelerators, where a large population of end …

RapidStream 2.0: Automated Parallel Implementation of Latency Insensitive FPGA Designs Through Partial Reconfiguration

FPGAs require a much longer compilation cycle than conventional computing platforms like CPUs. In this paper, we shorten the overall …

SASA: A Scalable and Automatic Stencil Acceleration Framework for Optimized Hybrid Spatial and Temporal Parallelism on HBM-based FPGAs

Stencil computation is one of the fundamental computing patterns in many application domains such as scientific computing and image …

Callipepla: Stream Centric Instruction Set and Mixed Precision for Accelerating Conjugate Gradient Solver

The continued growth in the processing power of FPGAs coupled with high bandwidth memories (HBM), makes systems like the Xilinx U280 …

Democratizing Domain-Specific Computing

Creating a programming environment and compilation flow that empowers programmers to create their own DSAs efficiently and affordably …

TARO: Automatic Optimization for Free-Running Kernels in FPGA High-Level Synthesis

Streaming applications have become one of the key application domains for high-level synthesis (HLS) tools. For a streaming …

Serpens: A High Bandwidth Memory Based Accelerator for General-Purpose Sparse Matrix-Vector Multiplication

Sparse matrix-vector multiplication (SpMV) multiplies a sparse matrix with a dense vector. SpMV plays a crucial role in many …

PYXIS: An Open-Source Performance Dataset Of Sparse Accelerators

Customized accelerators provide gains of performance and efficiency in specific domains of applications. Sparse data structures and/or …

StreamGCN: Accelerating Graph Convolutional Networks with Streaming Processing

While there have been many studies on hardware acceleration for deep learning on images, there has been a rather limited focus on …

Sextans: A Streaming Accelerator for General-Purpose Sparse-Matrix Dense-Matrix Multiplication

Sparse-Matrix Dense-Matrix multiplication (SpMM) is the key operator for a wide range of applications including scientific computing, …

RapidStream: Parallel Physical Implementation of FPGA HLS Designs

FPGAs require a much longer compilation cycle than conventional computing platforms like CPUs. In this paper, we shorten the overall …

Accelerating SSSP for Power-Law Graphs

The single-source shortest path (SSSP) problem is one of the most important and well-studied graph problems widely used in many …

Design Automation and Optimization for Memory-Bound Application Accelerators

As we witness the breakdown of Dennard scaling, we can no longer get faster computers by shrinking transistors without increasing power …

Recut: a Concurrent Framework for Sparse Reconstruction of Neuronal Morphology

Advancement in modern neuroscience is bottlenecked by neural reconstruction, a process that extracts 3D neuron morphology (typically in …

SPA-GCN: Efficient and Flexible GCN Accelerator with an Application for Graph Similarity Computation

While there have been many studies on hardware acceleration for deep learning on images, there has been a rather limited focus on …

Extending High-Level Synthesis for Task-Parallel Programs

C/C++/OpenCL-based high-level synthesis (HLS) becomes more and more popular for field-programmable gate array (FPGA) accelerators in …

HBM Connect: High-Performance HLS Interconnect for FPGA HBM

With the recent release of High Bandwidth Memory (HBM) based FPGA boards, developers can now exploit unprecedented external memory …

AutoBridge: Coupling Coarse-Grained Floorplanning and Pipelining for High-Frequency HLS Design on Multi-Die FPGAs

Despite an increasing adoption of high-level synthesis (HLS) for its design productivity advantages, there remains a significant gap in …

When HLS Meets FPGA HBM: Benchmarking and Bandwidth Optimization

With the recent release of High Bandwidth Memory (HBM) based FPGA boards, developers can now exploit unprecedented external memory …

Exploiting Computation Reuse for Stencil Accelerators

Stencil kernel is an important type of kernel used extensively in many application domains. Over the years, researchers have been …

Analysis and Optimization of the Implicit Broadcasts in FPGA HLS to Improve Maximum Frequency

Designs generated by high-level synthesis (HLS) tools typically achieve a lower frequency compared to manual RTL designs. In this work, …

HeteroHalide: From Image Processing DSL to Efficient FPGA Acceleration

The domain-specific language (DSL) for image processing, Halide, has generated a lot of interest because of its capability of …

FLASH: Fast, ParalleL, and Accurate Simulator for HLS

A large semantic gap between a high-level synthesis (HLS) design and a low-level RTL simulation environment often creates a barrier for …

Rapid Cycle-Accurate Simulator for High-Level Synthesis

A large semantic gap between the high-level synthesis (HLS) design and the low-level (on-board or RTL) simulation environment often …

HeteroCL: A Multi-Paradigm Programming Infrastructure for Software-Defined Reconfigurable Computing

With the pursuit of improving compute performance under strict power constraints, there is an increasing need for deploying …

SODA: Stencil with Optimized Dataflow Architecture

Stencil computation is one of the most important kernels in many application domains such as image processing, solving partial …

GraphH: A Processing-in-Memory Architecture for Large-scale Graph Processing

Large-scale graph processing requires the high bandwidth of data access. However, as graph computing continues to scale, it becomes …

An Optimal Microarchitecture for Stencil Computation with Data Reuse and Fine-Grained Parallelism

Stencil computation is one of the most important kernels for many applications such as image processing, solving partial differential …

ForeGraph: Exploring Large-scale Graph Processing on Multi-FPGA Architecture

The performance of large-scale graph processing suffers from challenges including poor locality, lack of scalability, random access …

FPGP: Graph Processing Framework on FPGA A Case Study of Breadth-First Search

Large-scale graph processing is gaining increasing attentions in many domains. Meanwhile, FPGA provides a power-efficient and highly …

NXgraph: An Efficient Graph Processing System on a Single Machine

Recent studies show that graph processing systems on a single machine can achieve competitive performance compared with cluster-based …

Test–Retest Reliability of Graph Metrics in High-resolution Functional Connectomics: A Resting-State Functional MRI Study

Background: The combination of resting-state functional MRI (R-fMRI) technique and graph theoretical approaches has emerged as a …