A Center for Sustainable Cloud Computing

Facebook and EPFL have initiated a collaborative program that aims to carry out seminal research with common meeting points for both organizations. Facebook seeks to leverage EPFL’s proven expertise in Computer Science and Engineering to enable the flow of technology from one of the most renowned research institutions to the leading American social media conglomerate. The collaboration will also help the latter strengthen its position in Switzerland and gain access to some of the best academic minds in Europe.

The following projects have already been lined up for the collaborative Full-System Accelerated and Secure ML Collaborative Research program:

  • Training for Recommendation Models on Heterogeneous Servers
  • Distributed Transformer Benchmarks
  • Full-System API Inference to Enforce Security
  • Communication Stacks for µServices in Datacenters

Each of these projects will be conducted by a member of the expert team from EPFL. The team includes David Atienza, Babak Falsafi, Martin Jaggi, and Mathias Payer. Babak Falsafi will be the point of contact for the engagement.

Training for Recommendation Models on Heterogeneous Servers   

This project aims to develop strategies to automatically select the best accelerator to run a specific DNN training. The research by David Atienza and team will develop the necessary software libraries to allocate workload efficiently by considering performance, power, and accuracy constraints. Meta-learning algorithms will be created to train DL models and configure their hyper-parameters in an automated way, outperforming current state-of-the- art approaches. This approach is expected to result in significant savings in the total training time and improved robustness against minimization for smaller memory size designs.

Distributed Transformer Benchmarks

MLBench, a framework for distributed machine learning, aims to perform the role of an easy-to-use and fair benchmarking suite for algorithms as well as for systems (software frameworks and hardware). It will provide re-usable and reliable reference implementations of distributed ML training algorithms. MLBench renders support to a wide range of platforms, ML frameworks, and machine learning tasks. Its goal is to benchmark all/most currently relevant distributed execution frameworks. Lead researcher Martin Jaggi and team will soon release the first results and reference code for distributed training (starting with Cifar10 and ImageNet, in both PyTorch and TensorFlow).

Full-System API Inference to Enforce Security

Mathias Payer and team aim to build an API flow graph (AFG) that encodes all valid API interactions and their parameters. The proposed algorithm will build the global AFG by analyzing all uses of a function on the system’s source code. The researchers will leverage test projects that provide a large corpus of test cases and input files for a wide variety of programs. The data set will help infer API usage by monitoring the state construction through the provided seeds and examples.

Communication Stacks for µServices in Datacenters

In this study, Babak Falsafi and others will investigate technologies to support communication in microservices. The research is an extension of their prior work on tighter integration of network with memory with support for memory pooling and RPC scheduling. It aims to tackle the software bottleneck in communication for microservices and address challenges such as memory scalability for RPC, software stacks for high fan-out RPC processing, higher-level object access semantics via RPC to avoid multiple roundtrips, and support for data transformation across diverse language and software ecosystem boundaries. The researchers will investigate codesigned RPC technologies with hardware terminating protocols that enable serving packets directly out of CPU’s SRAM to eliminate DRAM capacity and bandwidth provisioning and enable a new class of RPC substrate that is inherently technology-scalable. They propose to investigate optimizations for data transformation for common case data formats running conventional CPU’s. They will delve into the integration of data transformation into an optimized RPC stack (from above) to identify opportunities for data placement, reduction in data movement and buffering on commodity hardware. Technologies for hardware/software co-design of data transformers will also be within the scope of the work.

The Facebook-EPFL collaborative engagement has been approved for funding for an initial period of one year, with an expected renewal each year for at least three years. Each project includes a grant of CHF 200,000 per year, which will be used to financially support one student.

For more details of the individual projects, visit:

https://ecocloud.ch/project/training-for-recommendation-models-on-heterogeneous-servers/
https://ecocloud.ch/project/mlbench/
https://ecocloud.ch/project/full-system-api-inference-to-enforce-security/