Serialization Conundrums
June 2nd, 2007Strata is a B+Tree that I’ve written in Java. It implements Java serialization. Not to store the leaves of the tree, but to store the definition of the tree itself.
BentoStorage.Creator newStorage = new BentoStorage.Creator();
newStorage.setSize(8 + 8);
newStorage.setReader(new MyRecordReader());
newStorage.setWriter(new MyRecordWriter());
Strata.Creator newIndex = new Strata.Creator();
newIndex.setStorage(newStorage.create());
newIndex.setExtractor(new MyFieldExtractor());
Strata strata = newStrata.create();
Bento.Creator newBento = new Bento.Creator();
FileOutputStream outFile = new FileOutputStream(new File("index"));
ObjectOutputStream out = new ObjectOutputStream(outFile);
out.writeObject(strata);
out.close();
Here is how MyReader might be defined.
public final class MyReader
implements BentoStorage.Reader, Serializable
{
private final static int serialVersionID = 20070602L;
public Object read(ByteBuffer bytes)
{
return new MyRecord(bytes.getLong(), bytes.getLong());
}
}
What if I felt like using this from Groovy and taking advantage of closures?
BentoStorage.Creator newStorage = new BentoStorage.Creator();
newStorage.size = 8 + 8
newStorage.writer = { object, bytes ->
bytes.putLong(object.key)
bytes.putLong(object.version)
}
newStorage.reader = { bytes ->
return new MyRecord(bytes.getLong(), bytes.getLong())
}
Strata.Creator newIndex = new Strata.Creator()
newIndex.storage = newStorage.create()
newIndex.listExtractor = { txn, object ->
Person person = txn.heap.get(txn.unmarshaller, object.key)
return [ person.lastName, person.firstName ]
}
Strata index = newIndex.create()
The problem is that Groovy closures cannot be serialized. Now I cannot serialize Strata. This struck me as a setback for Groovy support, and a problem specific to Groovy, until I stubbed my toe on annonymous inner classes.
Strata.Creator newIndex = new Strata.Creator();
newIndex.setStorage(newStorage.create());
newIndex.setExtractor(new FieldExtractor()
{
public Comparable[] getFields(Object txn, Object object)
{
MyRecord record = (MyRecord) object;
MyDatabase database = (MyDatabase) txn;
Person person = database.getHeap()
.get(database.getUnmarshaller(), record.getKey());
return new Comparable[] { person.getLastName(),
person.getFirstName() };
}
});
Strata index = newIndex.create()
Even if I make the anonymous inner class Serializable, perhaps by deriving from an interface that mixes FieldExtractor and Serializable the class that defines the method that builds the Strata must also be serializable. I’m serializing the hidden reference to the containing class. Unnecessary code. Pointless code. Not really part of the data structure.
This must be a conundrum faced any library that could be built using anonymous inner classes.
I wanted my creational pattern to follow the notion of creatation of the object through construction using the Creator classes once. Thereafter, the Strata is serialized and deserialized.
I do not have to get rid of this pattern, but in order to implement it, people will have to avoid the use of anonymous inner classes to define their readers, writers, and extactors. The should always create static classes that implement Serializable.
If they do want to use anonyomous inner classes for readers, writers and extractors, they can simply repeat the creation of the object through construction using the Creator classes. Same goes for the use of Groovy closures.
This muddles an assumption that I had higher in my application stack. I am working on an object database that I call Depot.
The object database I’m building based on Strata stores the Strata B+Tree definitions in a heap. I created a bridge interface for the FieldExtractor, where the object database fishes the object out of the heap, and so only has one parameter, the object found in the heap, and does not pass in the application specific transaction context object.
public class Strata
{
public interface FieldExtractor
{
Comparable[] getFields(Object txn, Object object);
}
}
public class Depot
{
public interface FieldExtractor
{
Comparable[] getFields(Object object);
}
}
I’d written example code to imagine how the API would work. With Strata I’d always made Strata.FieldExtractor static, because it was always so complicated, turning a key into an object. With Depot that was taken care of, so the application only had to crack the object and return the array of @Comparable@S upon which an alterate index would sort a set of objects.
Furthermore, I’d decided that for a desktop application, it would be nice to keep the definition, trees and heap in one file. Not thinking, I’d made so that the initial version of Depot demands that the Depot.FieldExtractor be Serializable. My code examples of how clever I am, to define an index in a few short lines of code, did not work out.
It is not so much a conundrum. I merely need to make this one file format optional. The definition could be deserialized from the one file, or it could simply be rebuilt. It’s an application developers choice. If the application developer choses to make her @Depot.FieldExtractor@S static and serializable, then I can provide a static function that will create a file store for use as a heap, that tucks that definition into a longish header field.
I’m sure you could serialize a compound Swing view after you’ve built it, but you could also build it each time and only serialize the user preference settings, like the position of a window splitter.
If anyone out thinks that I misunderstand the trade off, please let me know. It caught me off gaurd. I was going to trouble the Groovy listserv with some questions, but this understanding came to me while I was getting around to that.
comment
Strata Storage Strategies
April 6th, 2007At the outset, Strata was developed for use with in memory and file backed implementations of tiers.
When in memory, branches and the objects can be stored in arrays. The default implementation uses ArrayList classes to store objects.
When file backed, their are two types of objects that can be stored, variable length objects and fixed length objects. Objects can be stored within the tiers or by reference.
When file backed, tiers are softly or weakly referenced, so that they can be collected. Thus, they can act as caches of the objects they reference.
Storing by refernece means storing an address. Whether or not the object is stored in the tier, or by reference outside the tier, the object needs to be in memory for comparisons, for traversal at both the leaf and the inner tier.
By reference storage can keep the object within the in memory tier, or it can perform a look up of the referenced object. The object is an simple object, like a string, it might be easier to keep it in the memory tier, as part of an address, object pair. If it is a complicated object, it can load the object through a caching mechanism.
Strata Objectives
April 4th, 2007These were once the objectives for Strata, reflecting an early understanding of the project. I’m about to rewrite them, so I thought I’d put them somewhere, besides the wiki history.
Thus, the short-term objectives for Strata are as follows.
- Leaf nodes only to store references to file positions, not store the data itself.
- Branches also store only references to file positions, and hold a list of objects read from those file positions to use as keys.
- Thread-safe.
- File backed, paged.
- Supports efficent insertion and deletion, while maintaining the balance of the B-Tree.
- Equality comparison query.
- Compares byte buffer ranges, not objects.
Crazy thoughts included storing partial in the inner tiers on pages, but this presents a strange case where a partial key might not fit on a page, if the length of a key is unbounded.
What changed? For the first two points, only the way I’d state them. Since it’s now a given, I don’t know if I’d state them. There are also a few more devils in those two details, only the fields of interest may be cached for example.
“Equality comparison query”, meant that the only operator supported is the equality operator, which is also given, but that’s all you need to implement between, less than, greater than, etc. Probably could not see that then.
“Compares byte buffer ranges, not objects”, the meaning of this has escaped my memory.
The crazy thoughts are reflections on how to store the objects referenced by the B+Tree. There was a time when I thought that the objects might be stored in the pages of the tree.
B-Tree
September 8th, 2006I’m implementing a B-Tree in Java. The project is called Strata. You can view the source, is is in the strata subdirectory of my subversion repository.
I’ve yet to choose a license.
This tree is the first indexing option in Momento, which is a large file XML database that I wrote last year, when I was in Ann Arbor.
Getting things done in Java is a challenge. There are many ways to get destracted.
Strata will index the normalized text of a specific node, and do so without concern about range queires. The only query supported at the outset is equality.
Then I can use Lucene to do full text indexing. In order for Lucene to be effective, you see, it must be a simple matter to retrieve the document. Indexing unique atom identifers, would give a key to fetch the document resulting from a Lucene query.
The research that I’m doing on B-Tree implemetnation can be tracked with my del.icio.us account, under the tag strata.
Thus, the objectives for Strata are as follows.
- Leaf nodes only to store references to file positions, not store the data itself.
- Branches also store only references to file positions, and hold a list of objects read from those file positions to use as keys.
- Thread-safe.
- File backed, paged.
- Supports efficent insertion and deletion, stays balanced.
- Equality comparison query.
- Compares byte buffer ranges, not objects.
At some point, it may be better to keep partial keys in the tree itself.
Update: I’ve written a description of the tree structure at my user page on the New Orleans Wiki, under a Software Development page.





