A Center for Sustainable Cloud Computing

Cloud Data Management

SmartDataLake: Sustainable Data Lakes for Extreme-Scale Analytics

An alternative approach to run queries and perform data analysis.

Big data is an integral part of many businesses because critical decisions are made based on the analysis of diverse and heterogeneous data sources and databases. In the conventional data analysis approach, the core element of business intelligence for storage, reporting, and analysis of big data are data warehouses. However, data warehouses cannot handle heterogeneous data formats or schemas. To circumvent that problem, a process called ETL (Extract, Transform, and Load) is employed to extract data from different source systems, convert into a fixed and acceptable format, and load the data into the data warehouse system for running queries. This traditional system of data analytics is inefficient because ETL is an extremely time-consuming process. Besides, the need to conform to a fixed format and schema precludes the use of new data or changes in existing ones over time.

An alternative approach to run queries and perform data analysis is the Data Lakes platform. Contrary to the data warehouse approach, data scientists can directly tap into the Data Lake to analyze new and different data types to achieve fast and efficient decisions. This is possible because Data Lakes are raw data ecosystems where large amounts of diverse data coexist in their original model and format.

The goal of the SmartDataLake Project (https://smartdatalake.eu/), funded by the Horizon 2020 Framework Programme of the European Union, is to design, develop, and evaluate novel approaches and techniques for extreme-scale analytics over Big Data Lakes, facilitating the journey from raw data to actionable insights.

SmartDataLake aims to achieve the following objectives:

Virtualized and Adaptive Data Access

The project will facilitate efficient direct data access to heterogeneous data and allow cross-format query optimizations. In this context, the scientists have already developed a query optimizer and execution engine that efficiently accesses and runs queries across heterogeneous data formats without revealing the complexity of the underlying data formats to the user.

Automated and Adaptive Data Storage Tiering

To reduce hardware costs and operational expenses, techniques will be developed to utilize different storage tiers and design a transparent multi-tier data storage layer. Work is in progress to optimize data access by automatically placing data in the storage hierarchy.

Smart Data Discovery, Exploration and Mining

From the user’s perspective, the project will create an entity-centric view and organization of the data to facilitate ease of access to the Data Lake’s contents, besides providing a machinery of mining operations over Heterogeneous Information Networks.

Monitor Changes and Assess their Impact

Changes in the Data Lake will be closely monitored, with incremental updates of the results of the analyses.

Support the Data Scientist in the Loop

To leverage the data scientist’s intuition and domain knowledge, scalable and interactive visualizations will be provided for different types of data. The project will also enable filtering, aggregating, ranking and summarizing information in multiple dimensions.

The results of the project will be assessed in real-world cases from the business intelligence domain, including situations that call for portfolio management, product planning and pricing, and investment decisions.

Project website: https://smartdatalake.eu