Trash Day: Coordinating Garbage Collection in Distributed Systems

Cloud systems such as Hadoop, Spark and Zookeeper are frequently written in Java or other garbage-collected languages. However, GC-induced pauses can have a significant impact on these workloads. Specifically, GC pauses can reduce throughput for batch workloads, and cause high tail-latencies for interactive applications.

In this paper, we show that distributed applications suffer from each node’s language runtime system making GC-related decisions independently. We first demonstrate this problem on two widely-used systems (Apache Spark and Apache Cassandra). We then propose solving this problem using a Holistic Runtime System, a distributed language runtime that collectively manages runtime services across multiple nodes.

We present initial results to demonstrate that this Holistic GC approach is effective both in reducing the impact of GC pauses on a batch workload, and in improving GC-related tail-latencies in an interactive setting.