Scientists in disciplines like biology, chemistry, physics etc. produce vast amounts of data through experimentation and simulation. The amounts of data produced are already so big that they can barely be managed. And the problem is certain to get worse as the volume of scientific data doubles every year. In the DIAS laboratory we are working on next generation data management tools and techniques able to manage tomorrow’s scientific data.

Efficient Data-Management for Scientific Applications (BrainDB)

Scientists in disciplines like biology, chemistry, physics etc. produce vast amounts of data through experimentation and simulation. The amounts of data produced are already so big that they can barely be managed. And the problem is certain to get worse as the volume of scientific data doubles every year. In the DIAS laboratory we are working on next generation data management tools and techniques able to manage tomorrow’s scientific data.

We study large and demanding scientific databases and are particularly interested in:

  • aiding scientists with “systems” work, such as database schema, physical design, and data

    placement on disks (and automating all related procedures to minimize the need for human

    intervention)

  • designing and developing computational support for popular scientific data structures (and

    especially ones not currently supported by cutting-edge database technology, such as tetrahedral

    meshes or protein structures)

  • understanding and aiding the logical interpretation of data (including data cleaning, validation, and

    schema mapping)

Indexing the Brain – a Petabyte Challenge

In the context of this project we address the particular problems of neuroscientists on their quest to understand and simulate the rat brain. More specifically, we work with neuroscientists in the Blue Brain Project (bluebrain.epfl.ch) to manage the vast amounts of data they produce. Their research, modeling and simulating a fraction of the rat brain, already produces gigabytes of data. With the recent upgrade of their computing infrastructure (IBM Blue Gene/P), the volume of data will soon be in the order of petabytes.

Current solutions are inadequate to manage this data volume and we are thus investigating new methods to index and store it in order to provide efficient access. A particular problem we are currently addressing is the retrieval of objects in space, i.e., accessing neurons based on their position. While it is simple to index several thousand neurons, we will have to do it for several millions or even billions of neurons. We will have to develop new spatial indexes to solve this problem.

Automated Physical Design

One of the most difficult tasks of a database administrator is the definition of a physical design for a database. Given a database and a query workload, he has to choose the proper physical design features, such as horizontal and vertical partitions, indexes, or materialized views, in order to speed up the queries in the workload. These features can be arbitrarily combined, leading to a vast search space.

Without support from the database management system, the only way the administrator can decide on the optimal physical design structures is to build them manually and then estimate the query execution time for combinations of the design features. This task is both cumbersome and expensive, as building design features, such as indexes takes a considerable amount of time and planning. Therefore, automating the physical design selection is crucial. In this project we therefore aim at developing algorithms and tools that support the database

administrator in finding optimal combinations of design features to speed up query workloads. According to World Health Organizations report, nearly 6 million people suffer from neurological disease in USA, about 180 Billion$ are spent annually on their treatment. Despite this staggering fact the human brain still remains to be most mysterious, least understood biological organ. The Blue Brain Project is the first comprehensive attempt to reverse-engineer the mammalian brain, in order to understand brain function through detailed simulations. The virtual brain would serve as an exceptional tool that can provide neuroscientists with a virtual experimentation environment and a platform to explore the fundamental concepts.

The new hope however, comes with its fair share of challenges. The human brain contains nearly 86 billion neuron cells, and at present even simulating 100 thousand cells creates a tsunami of data, much of which is not used because of inadequate support to manage and process the data. Through BrainDB project, we are committed to find creative and novel solutions that can aid the neuroscientists on their journey towards simulating the entire human brain. In BrainDB we focus on problems related to finding efficient and scalable tools for processing neural

data. In particular we observe the need of spatio-temporal access methods of prime significance. We have successfully developed novel techniques that provide fast access to spatial data through indexing and prefetching. Apart from modeling and simulation field, BrainDB project also focuses on data mangement problem arising in massive distributed environments such as the super computer BlueGene/P. The project in short aims to contribute by:

  • (a) publishing novel techniques in the top tier database proceedings, and
  • (b) by developing tools that are used in the production system by neuroscientists to scale up the brain

    simulation.