Stencil computation is one of the most important kernels for many applications such as image processing, solving partial differential equations, and cellular automata. Nevertheless, implementing a high throughput stencil kernel is not trivial due to its nature of high memory access load and low operational intensity. In this work we adopt data reuse and fine-grained parallelism and present an optimal microarchitecture for stencil computation. The data reuse line buffers not only fully utilize the external memory bandwidth and fully reuse the input data, they also minimize the size of data reuse buffer given the number of fine-grained parallelized and fully pipelined PEs. With the proposed microarchitecture, the number of PEs can be increased to saturate all available off-chip memory bandwidth. We implement this microarchitecture with a high-level synthesis (HLS) based template instead of register transfer level (RTL) specifications, which provides great programmability. To guide the system design, we propose a performance model in addition to detailed model evaluation and optimization analysis. Experimental results from on-board execution show that our design can provide an average of 6.5x speedup over line buffer-only design with only 2.4x resource overhead. Compared with loop transformation-only design, our design can implement a fully pipelined accelerator for applications that cannot be implemented with loop transformation-only due to its high memory conflict and low design flexibility. Furthermore, our FPGA implementation provides 83% throughput of a 14-core CPU with 4x energy-efficiency.