Alan Gutierrez

Alan Gutierrez blogs on software, social networks, and himself.

Subscrive Via RSS Feed

Sheaf

Sheaf Toss

Sheaf Toss by Ian Langworth.

Last Saturday morning, on my way to Fair Grinds for my morning coffee, something occurred to me about persistent object storage. I’m less certain as to why. I’ve not thought about that component of my application development in a long time. I implemented persistent storage using the basics of the malloc algorithm in a project that I called Bento.

Pages of Blocks with Identifiers

When I got to Fair Grinds I began to exchange instant messages with Brian Kerr or Ann Arbor, Michigan. I described the idea as follows.

A fixed page length, 8k or 16k. The address is a file position plus a short. It is a big address, I know. The blocks are stored in a page with a number to identify the blocks followed by the object length followed by the block data. To find a block, to the file position, load the page into memory, skip through the blocks until you find the one with the matching block number.

Within a page the block is sought so that within a page blocks can be relocated within the page. If a block is deleted, the contents following the block are shifted up. The moved blocks can still be found because we are skipping through the page form an offset.

Then Brian started asking questions and sorting out details. It was good to have someone think about the problem.

If you have a block larger than a single page, you break it up among pages. You use however many empty pages necessary to hold the block and put the remainder in a page that contains many blocks.

The file format supports journaling, so there is a concept of isolation. Isolation is modeled with a Mutator object. Changes to the file are called a mutation, rather than a transaction, since transaction implies a lot more hocus-pocus.

While the file format can handle concurrent requests, it cannot handle concurrent requests on the same block. It is up to the application developer to ensure that no two threads mutate the same block at the same time. It is up to the application developer to ensure that no thread reads a mutating block. The only manipulation that is permitted concurrently is read. Block allocations are going to be visible only to the thread that allocated them. Writes and frees are going to require synchronization on the part of the application.

In memory, a sorted map of lists of blocks with free space remaining is kept sorted on the amount remaining rounded down to an alignment. When a block is allocated it is rounded up to the next alignment and we search for the block with the smallest amount remaining that can fit the block. That block is used.

Separately, there is a list of pages that have been entirely emptied through deallocations.

The file format uses ByteBuffer objects kept in memory as buffers that represent the contents on disk. The ByteBuffer objects indicate what the disk will look like when the contents of the page are journaled. They can be out of sync with the file, but that is OK. This discrepancy between our in memory representation of the file and the on disk representation of the file allow us to implement journaling.

Commits to this file format are journaled. There are two journals available, an external file journal and an internal journal that uses pages in the file. For the internal journals the file format keeps a block of pointers in the header of the file. The default number of pointers is 64. To create a journal, one of the pointers is set to the address of the first page in a linked list of pages. This is set in the ByteBuffer that is in memory, however and not on disk. The address of the start of the linked list of journal pages is now written out until the journal commits. It is written out last and the file channel is forced to flush. Thus, on disk the pointers will only ever point to a complete journal.

In addition to a linked list of journal pages of journal actions, like allocate, write and free, the journal needs to allocate temporary blocks where the mutation can write changes without disturbing the actual block in memory or on disk. The temporary block contains the applications intent. It is copied over to the actual block during commit.

Delete Makes the File Grow

I set out to implement the file format. I’d hoped to have tests running on Monday. I begged off engagements. I sat before my computer and worked through each next logical task. I was not finished on Monday, so I forced myself to stop and resume my civic pursuits, while fiddling with the code and the concepts in my free time.

By Friday, the idea of the block identifier has run it’s course, before the code was complete.

An external journal meant that this file format would create a nice and compact file format. An internal journal would fill the file with empty journal pages. What really tripped me up was when I realized deletion could not be implemented as simply shifting the blocks following a deleted block up if I wanted the deletion to be journaled. The destination and the source of the copy were on the same page. If the page were to be corrupted their would be no way to recreate it. I would have to allocate copies of blocks following the deleted block in the journal. They would be safely on disk as part of the journal commit. The delete would then copy the blocks over from the journal.

This meant that a delete would actually cause the file to grow and grow large. If the first block on a page is deleted, a temporary block has to be allocated for all the other pages in the block. I couldn’t imagine myself very happy with a larger file.

Can the file ever shrink? Sure. If the pages at the end of the file empty, then we can truncate the file. We can always be on the lookout for empty pages at the end of a mutation and rewind the file.

But then, imagine a case where an application deletes a very large collection of objects. These deletions consume all the free pages for the journal. Then the deletions begin to create new journal pages at the end of the file. If another thread allocates a new object, the other thread couldn’t make use of all the free space created in isolation so it would have to allocate a block from a new page at the end of the file. When the thread with the deletions completes, it will free all of the journal pages, but the file cannot be rewound because there is a user page following all the freed journal pages.

The file would have a huge swath of empty pages just before the last page in the file. They would be reused eventually, but how counter-intuitive to delete objects and have the database grow in size on disk. That would be so hard to explain to people. It feels wrong.

I started to think about indirection, and a name for a new project.

Bento


Authentic c-store Japanese bento by Ramil Sagum.

Malloc

The idea behind Bento it to take a file and divide it up using the malloc algorithm as described by Doug Lea.

Bento is written in Java for JDK 1.4.2 using NIO.

You can read more about Bento on my Wiki and you can paruse the source in my Subversion repsitory and you can use it under LGPL.

Design Goals

Bento is a file format for writing out discrete blocks of data. It is designed for the storage of objects. Bento is not supposed to be a database in itself.

Bento is a robust file format that provides

  • storage of blocks of arbitrary size,
  • random access to stored blocks,
  • atomic writes of multiple blocks,
  • thread-safe allocation and deallocation of blocks,
  • and thread-safe reads and writes.

Funamentally, Bento strives to be a simple data structure that you can use to persist objects in your applications. You can use Bento as the basis for more complicated data structures or databases.

Usage

Bento can be used to serialize objects using Java serialization, to write out objects as Java primitives, or to store Java primitives.

A single Bento file can be organized into a database of different Java types. For desktop applications, this means that a single file can be the container for many different objects of many different classes.

Multiple Bento files can be organized into a database, spread across different file system, to reduce disk contention.

You can design resuable Bento based data structures and mix them into a single Bento file.

Quick Start

An example program that creates a Bento file with a static header and a single allocated block.

Bento.Creator creator = new Bento.Creator();
creator.addStaticBlock(URI.create("flag://agtrz.com/header"), 512);
Bento bento = creator.create(new File("data.bento"));

Bento.Mutator mutator = bento.mutate();
try {
    Bento.Block block = mutator.allocate(512);

    ByteBuffer bytes = block.toByteBuffer();
    for (int i = 0; i < 512 / 4; i++) {
        bytes.putInt(i);
    }
    block.write();

    Bento.Block header = mutator.load(URI.create("flag://agtrz.com/header"));

    ByteBuffer bytes = header.toByteBuffer();
    block.getAddress().write(bytes);
    header.write();

    mutator.commit();
}
finally {
    mutator.rollback();
}

bento.close();

Bento.Opener opener = new Bento.Opener();
bento = opener.open(new File("bento.data"));

ByteBuffer header = bento.read(URI.create("http://agtrz.com/header"));

ByteBuffer bytes = bento.read(new Bento.Address(header));
for (int i = 0; i < bytes.capacity() / 4; i++) {
    if (i != bytes.getInt()) {
        throw new IllegalStateException();
    }
}

bento.close();

Mutation

A Bento file is written using a Bento.Mutator. All of the changes performed by a mutator are atomic. The changes are not made perminent until the commit method of mutator is called. The changes can be discarded by calling the rollback method of the mutator.

After calling commit or rollback the mutator can be reused. Calling commit or rollback when there are no uncommited changes held by the mutator does nothing. Thus, we use a try/finally block to recover the state of a Bento file when a failure occurs.

public void storeObjects(Bento bento, Bento.Address address) {
    Bento.Mutator mutator = bento.mutate();
    try {
        Bento.Block block = mutator.load(address);
        block.toByteBuffer().putInt(this.magicNumber);
        block.write();

        mutator.commit();
    }
    finally {
        mutator.rollback();
    }
}

Concurrency

Bento synchronizes the list of free blocks, but does not synchronize the blocks themselves. Blocks can be allocated and freed in a thread-safe manner without further consideration by the application developer.

If you read and write blocks without some syncrhonization strategy, you’ll turn your Bento file into garbage in short order.

Multiple threads can read the same blocks, but blocks that are being written should not be written or read by other threads. Reads can be shared. Writes are exclusive.

A simple strategy is to use a read/write lock from the concurrency library to gaurd the file during updates.

Bento strives for liveness, however. It places as few locks as possible on the management of list of free blocks.

A more sophisticated strategy would place locks on the blocks themselves. A composite object, such as a tree, could lock a tree root element. Different threads can then operate on different trees in the same file. They would not run afoul of each other.

Atomic Writes

A Bento file is edited using a mutator the object. The mutator object keeps all block writes, allocations, and deallocations are kept in memory until the user commits or rolls back the changes made by the mutator.

Journaling

All reads, writes, allocations and frees are performed on buffers in memory. Only when a mutation is committed or rollback, is the underlying file written.

Journaling is implemented by writing out the list of operations into a journal prior to applying the changes to the file.

The journal can be implemented using blocks within the file, or it can be implemented using external temporary files. The former is good for desktop applications, where the user may move a corrupted Bento file out of reach of the external journals, or unwittingly delete the external journals. The latter is good for a server application, where multiple Bento files and journals can exist in a single directory, watched over by an administrator.