2D image convolution is ubiquitous in image processing and computer vision problems such as feature extraction. Exploiting parallelism is a common strategy for accelerating convolution. Parallel processors keep getting faster, but algorithms such as image convolution remain memory bounded on parallel processors such as GPUs. Therefore, reducing memory communication is fundamental to accelerating image convolution. To reduce memory communication, we reorganize the convolution algorithm to prefetch image regions to register, and we do more work per thread with fewer threads. To enable portability to future architectures, we implement a convolution autotuner that sweeps the design space of memory layouts and loop unrolling conﬁgurations. We focus on convolution with small ﬁlters (2×2–7×7), but our techniques can be extended to larger ﬁlter sizes. Depending on ﬁlter size, our speedups on two NVIDIA architectures range from 1.2x to 4.5x over state-of-the-art GPU libraries.
2D Accelerators Algorithms Architectures Arrays Big Data Bootstrapping C++ Cache Partitioning Cancer Careers Chisel Communication Computer Architecture CTF DIABLO Efficiency Energy FPGA GAP Gaussian Elimination Genomics GPU Hardware HLS Lower Bounds LU Matrix Multiplication Memory Multicore Oblivious Open Space OS Parallelism Parallel Reduction Performance PHANTOM Processors Python Research Centers RISC-V SEJITS Tall-Skinny QR Technical Report Test generation