A Center for Sustainable Cloud Computing

Sustainable Data Centers


Hybrid Data Scheduler

What if the best elements of centralized and distributed scheduling could be combined in a hybrid date scheduler?

Data center operators have limited resources in terms of capital and server systems, and yet, they must increase efficiency to handle enormous clusters of data. Compounding the problem is the heterogeneous nature of jobs, each having different requirements. While short jobs are latency sensitive, long jobs can withstand longer latency but suffer from inefficient scheduling. That brings to the fore the critical problem of scheduling tasks. There are two ways of achieving this: carrying out all scheduling decisions in a single place (centralized scheduling) or taking the distributed scheduling route. The latter scheduling system was developed to deal with the limitations of the former. However, as researchers at EPFL have shown, distributed scheduling is unable to handle short jobs under high load. They suggest a way forward: What if the best elements of centralized and distributed scheduling could be combined in a hybrid date scheduler?

Taking this proposition to a logical conclusion, the researchers first developed the hybrid data scheduler “Hawk,” and then its later incarnation “Eagle.” In comparison with the random probing and work stealing used in Hawk, Eagle improves job completion times considerably by using a centralized scheduler for long jobs (such as graph analytics) and distributed schedulers for short jobs. The efficient implementation of Eagle depends on the approximate nature of the information provided to the distributed schedulers, with care taken to avoid any adverse impact on the performance of the centralized scheduler.

In the context of Google, Yahoo, and Cloudera production traces, the study has shown that implementation of the hybrid data scheduler improves the 50th, 90th, and 99th percentile of the short jobs completion times distribution by a factor of 1.72 on average while having a negligible impact on long job completion time.

The other aspect of the research is resource management for data-parallel workloads. The study has shown that by leveraging memory elasticity, an integral property of data-parallel tasks, tasks can be performed with much less memory, albeit with a negligible performance penalty. For example, a task can be completed with only a tenth of the memory required, which far offsets the moderate increase in runtime.

The researchers include (in no particular order) Pamela Delgado, Florin Dinu, Anne-Marie Kermarrec, Calin Iorgulescu, Aunn Raza, and Willy Zwaenepoel. They are currently engaged at various laboratories of EPFL.

Suggested Readings