The SEJITS component of ASPIRE is designed to dramatically simplify creating performant and energy-efficient code that can be retargeted to a variety of platforms. Originally begun as part of the Par Lab, the SEJITS approach uses domain-specific languages embedded in Python to generate fast, efficient code for an underlying hardware platform.
Peter Birsinger, Richard Xia, Shoaib Kamil, and Armando Fox of ASPIRE will present a short paper at the ACM International Conference on Information and Knowledge Management (CIKM 2013) on “Scalable Bootstrapping for Python”. The paper introduces a SEJITS specializer, or DSEL (domain-specific embedded language) compiler, for the Bag of Little Bootstraps (BLB), a recently developed bootstrapping algorithm designed for distributed environments. We already had a BLB specializer that generated OpenMP or Cilk code for multicore CPUs; we’ve now extended it to generate code for Spark, a cluster-based MapReduce-like computing platform. That means a data scientist can write a single, serial Python program that can run “toy” problems in plain Python, non-toy problems that fit on a single computer in OpenMP or Cilk with good parallel performance, and much larger problems with large datasets on a multi-computer Spark installation.
In this paper we evaluated the performance of the generated Spark BLB code on an email classifier for the Enron public email corpus and an estimator of 2-gram word frequency ratios across different decades using data from the Google N-gram dataset (201 GB). The experiments show strong scaling from 4 to 32 Amazon EC2 nodes (32 to 256 cores).
We are currently working with Dr. Gerald Friedland and others at ICSI to apply this specializer to multimedia classification problems.