Synthesis directly from binaries has been proposed as an option to alleviate the design burden. However, in program binaries, loop bounds and loop invariants used for memory index calculation are often compiled into runtime data stored in registers or memories, making static loop dependence analysis infeasible. In this work, a two-phase approach is presented to address this issue with: 1) an offline phase to recover memory access patterns in the loop for data dependence analysis based on software profiling. and 2) an online phase to dynamically check for parallelization assertions. We use this method to discover and exploit coarse-grained parallelism for accelerating compute-intensive affine loops in binaries. With our target platform, the Zynq-7000 FPGA SoC, we ran and examined four benchmarks with our flow: GemsFDTD, Matrix Multiply, Sobel Edge Detection, and K-Nearest Neighbors. Results show up to 9.5x speedup with our flow compared to the pure software flow on the ARM Cortex A9 processor.
2D Accelerators Algorithms Architectures Arrays Big Data Bootstrapping C++ Cache Partitioning Cancer Careers Chisel Communication Computer Architecture CTF DIABLO Efficiency Energy FPGA GAP Gaussian Elimination Genomics GPU Hardware HLS Lower Bounds LU Matrix Multiplication Memory Multicore Oblivious Open Space OS Parallelism Parallel Reduction Performance PHANTOM Processors Python Research Centers RISC-V SEJITS Tall-Skinny QR Technical Report Test generation