A Center for Sustainable Cloud Computing

Cloud Data Management


Minimizing data-to-query time for the era of data deluge

Prof. Anastasia Ailamaki ~ Project Website

In the NoDB project, we recognize the need to minimize the data-to-query time, which has arisen as a direct consequence of the data deluge. We propose a new generation of data management systems (DBMS), in order to make database systems more friendly and accessible to end-users by eliminating the major bottlenecks of current state-of-the-art technology (i.e., the data loading overhead). We advocate in situ querying in order to manage data and extend traditional query processing architectures, in accordance with the query demands.

As data collections become ever larger, we find ourselves caught in a data deluge – we have far more data than what we can move and store, let alone analyze. Data files are considered to be an integral part of the system, and the main copy of the data remains outside the data management system, under user-chosen formats. In contrast with conventional DBMS, no time is spent loading the data or tuning the database system: instead, this is done progressively and adaptively as a side effect of query execution.

By not loading the data in advance into the database, we thus avoid:

  • a) The time- and resource-consuming procedure of loading the entire data set; and
  • b) Making critical design decisions on the physical database design before real user queries starthitting the system.

Although DBMS remain overall the predominant data analysis technology by providing unparalleled flexibility and performance when it comes to query processing, scalability and accuracy, they are rarely used for emerging applications such as scientific analysis and social networks. This is largely due to the prohibitive initialization cost, complexity (loading the data, configuring the physical design, etc) and  the increased “data-to-query” time (i.e., the time spent from when the data is available until the moment where the answer to a query is obtained).

The data-to-query time is of critical importance, as it defines the moment when a database system becomes usable and thus useful. For example, a scientist may need to quickly examine several Terabytes of new data in search of certain properties: even though only few attributes might be relevant for the task, the entire data must first be loaded inside the database.

For large amounts of data, this may mean up to a few hours of delay. Additionally, the optimal storage and access patterns may change as often as on daily basis depending on the new data, its properties, correlations, as well as the ways that the scientists navigate through the data and the ways their understanding and interpretation of the data evolve. In such scenarios, no up-front physical design decision can be optimal in light of a completely dynamic and evolving access pattern. As a result, scientists have currently often had to compromise functionality and flexibility, relying on custom solutions for their data management tasks in order to achieve bread-and-butter functionality of a DBMS, delaying scientific understanding and progress. NoDB will change this.