Large-scale Data Analytics
Maximize resource utilization and concurrently achieve high parallelism and load balance.
Big data computing has emerged as a powerful paradigm not only for Internet companies but also for conventional business institutions and government agencies. According to a study by International Data Corporation, the market size is expected to grow exponentially from $130.1 billion in 2016 to more than $203 billion by the end of the decade. That is why cash-rich companies are investing heavily in Artificial Intelligence and its subsets. But what has been missing so far is a similar thrust in developing new design principles and software architecture for analyzing large-scale data. That is the core area of a new study in progress at EPFL’s Operating Systems Laboratory (LABOS).
Lead researcher Willy Zwaenepoel and his team are developing innovative design principles that could provide a good match between big data algorithms and the underlying computing and storage resources. They have identified two main problem areas in big data analysis: graph-structured data and data skew.
Graph analytics imbibes complicated algorithms, which consume considerable computational resources in processing peta-scale data analysis for diverse fields such as social networking, medicine, bioinformatics, content analysis, and search engines. Computing resources and memory needs, however, have failed to keep up with the increasing scale of data analytics.
The existing approach to process large graphs is to store them in the main memory of a single machine or several machines. In contrast to this in-memory approach, Professor Zwaenepoel’s method proposes to process graphs from secondary storage. This approach uses up only a fraction of the resources required by the conventional method. The research team has developed two systems to implement this scheme: X-Stream for processing graphs from secondary storage on a single machine, and SlipStream, which works with distributed storage in a cluster.
Prof. Zwaenepoel’s research also proposes a new approach to tame data skew, which refers to a non-uniform distribution in a dataset. Data skew in complex database queries results in poor load balancing and increased response time. By targeting these problems, the research aims to maximize resource utilization and concurrently achieve high parallelism and load balance.