A Center for Sustainable Cloud Computing

DNN training and inference have similar basic operators but with fundamentally different requirements. The former is throughput bound and relies on high precision floating-point arithmetic for convergence while the latter is latency-bound and tolerant to low-precision arithmetic. Both workloads require high computational capabilities and can benefit from hardware accelerators. The disparity in resource requirements forces datacenter operators to choose between custom accelerators for training and inference or training accelerators for inference.

However, neither of these two options is an optimum solution. While the former results in datacenter heterogeneity and higher management costs, the latter results in inefficient inference. Moreover, dedicated inference accelerators face load fluctuations, leading to overprovisioning and low average utilization.

The objective of EPFL’s ColTraIn: Co-located DNN Training and Inference team of PARSA and MLO is to restore datacenter homogeneity and co-locate training and inference without compromising inference efficiency or quality of service (QoS) guarantees. ColTraIn aims to overcome two key challenges: (1) the difference in the arithmetic representation used in workloads, and (2) the scheduling of training tasks in inference-bound accelerators. The recent release of HBFP (Hybrid Block Floating Point) meets the first challenge.

HBFP trains DNNs with dense, fixed-point-like arithmetic for most operations without sacrificing accuracy, thus facilitating effective co-location. More specifically, HBFP offers the accuracy of 32-bit floating-point with the numeric and silicon density of 8-bit fixed-point for many models (ResNet, WideResNet, DenseNet, AlexNet, LSTM, and BERT).

The open-source project repository is available for ongoing research on training DNNs with HBFP.

The ColTraIn team is working to address the second challenge of developing a co-locating accelerator. The design adds training capabilities to an inference accelerator and pairs it with a scheduler that takes both resource utilization and tasks’ QoS constraints into account to co-locate DNN training and inference.