Design Details

Overview

babudb is an embedded database library providing management of persistent key-value indices.

babudb is inherently designed to support replication by providing full access to the operation log. The Java version comes with all the necessary components to set up a replicated database system; the C++ version is a plain embedded database with full access to the operation log.

babudb's architecture is based on LSM (overlay) trees. Instead of transactional operations on persistent B-trees, babudb maintains a large, immutable on-disk index and applies modifications to this index to a set of volatile overlay tress in memory. These modifications are made persistent by appending them to an operation log. All lookups check the overlay trees first and fall back to the on-disk index when no modifications have been made. The overlay trees are periodically merged with the on-disk index in the background.

babudb has been designed to implement replicated databases for systems in which each node embeds a local babudb replica and accesses it exclusively.

What babudb does not provide:

client-server access
a data model beyond key-value pairs
a query engine
ACID transactions

Organisation of a Database

Blocked and Non-blocking access

Stages

Log Format

Indices

A BabuDB index is composed of a set of in-memory trees and a sorted on-disk index. A user snapshot creates a new in-memory index which has higher priority than the existing indices. Each index is traversed in descending order of priority with the disk-index last for a lookup. This is to guarantee that the latest value for a key is returned.

Checkpoints (materialized snapshots) are triggered periodically by an external thread and writes out a new on-disk index. This index is a combination of the old on-disk index and the in-memory indices. Note that a checkpoint is written out asynchronously to disk and can be re-scheduled depending on the disk usage.

Disk structure

The on-disk index is stored as an array of key/value-pairs sorted on the key. The array is divided into blocks and in order to find keys efficiently a sparse index with pointers to each block is constructed. Searching inside the block index is done using binary search. The block index is written out after the array of key/value-pairs, but a copy is always kept in memory. It is also read back into memory upon recovery. This reduces the disk seeks per key lookup to at most one. The entire on-disk index is mmap:ed to let the OS take care of the disk block buffer management.

The blocks are laid out to improve cache locality (see http://www-db.in.tum.de/teaching/ws0506/MMDBMS/download/pax_VLDB2001.pdf for more details). It starts with a header containing a pointer offset to the values, the number of entries in the block and indication if the keys and/or values are variable or fixed length. After that, the keys are stored ending with a list of pointer offsets to each key if they are variable length. The block ends with the values and a list of pointer offsets to each value.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly