Synthesis directly from binaries has been proposed as an option to alleviate the design burden. However, in program binaries, loop bounds and loop invariants used for memory index calculation are often compiled into runtime data stored in registers or memories, making static loop dependence analysis infeasible. In this work, a two-phase approach is presented to address this issue with: 1) an offline phase to recover memory access patterns in the loop for data dependence analysis based on software profiling. and 2) an online phase to dynamically check for parallelization assertions. We use this method to discover and exploit coarse-grained parallelism for accelerating compute-intensive affine loops in binaries. With our target platform, the Zynq-7000 FPGA SoC, we ran and examined four benchmarks with our flow: GemsFDTD, Matrix Multiply, Sobel Edge Detection, and K-Nearest Neighbors. Results show up to 9.5x speedup with our flow compared to the pure software flow on the ARM Cortex A9 processor.
Publications
Tags
2D
Accelerators
Algorithms
Architectures
Arrays
Big Data
Bootstrapping
C++
Cache Partitioning
Cancer
Careers
Chisel
Communication
Computer Architecture
CTF
DIABLO
Efficiency
Energy
FPGA
GAP
Gaussian Elimination
Genomics
GPU
Hardware
HLS
Lower Bounds
LU
Matrix Multiplication
Memory
Multicore
Oblivious
Open Space
OS
Parallelism
Parallel Reduction
Performance
PHANTOM
Processors
Python
Research Centers
RISC-V
SEJITS
Tall-Skinny QR
Technical Report
Test generation
