Trustworthy Platforms & Systems

Enabling Predictable Datacenters

Enabling Predictable Datacenters

What is the way forward to mitigate key contributors to performance unpredictability like network and storage?

Many facets of today’s digital world rely heavily on online services, be it for networking, e-commerce activities, live streaming, communication, or web search. The underlying data is huge, and so is the number of highly interactive processes involved. Naturally, this entails a high demand for computing and storage, usually managed by data centers equipped with thousands of servers. What makes them tick is a network of fault-tolerant systems, which include data-analytics stacks, distributed storage systems, cluster schedulers, and micro-services for web-applications. The dynamic conditions under which these systems operate make it difficult to ensure end-to-end performance predictability. In fact, data centers often suffer from unforeseen performance fluctuations.

So how can data centers mitigate the sources of unpredictability? How can they satisfy service-level objectives (SLOs) such as availability, throughput, frequency, fluid response time, and quality? How can they guarantee performance to reduce customer costs and enhance resource utilization? What is the way forward to mitigate key contributors to performance unpredictability like network and storage?

To seek answers to these moot questions, a study has been initiated at EPFL’s Operating Systems Laboratory, with Professor Willy Zwaenepoel as lead researcher. The team has identified the distributed file system as the primary cause for performance fluctuations. With that starting point, the researchers are investigating the interactions of distributed systems with other layers such as local file systems and data-analytics frameworks.

It is a notoriously difficult task to design and implement a robust fault management mechanism for distributed systems because of their ever-increasing scales and the density and unpredictability of the system executions. The current research seeks to address this problem by suggesting adaptive mechanisms to achieve performance predictability for distributed systems. It explores the complex interactions between various workloads, which degrade system performance, and proposes a design that can boost performance predictability in distributed systems by increasing resource utilization and reducing latencies.

Suggested readings