We have built an open application framework suitable for those who wish to construct highly modular, decoupled, extensible data management applications.
We chose Java as our implementation language because it offers many improvements to the traditional desktop environment with built-in support for networking, security, multithreading, and advanced interfaces. Further, with no pointers, no multiple inheritance, strong typing, garbage collection, a rooted hierarchy, dynamic linking, component support, and declared exceptions, it is a much cleaner and safer language than C++. Finally, compared to the somewhat cleaner Smalltalk, its licensing and hardware independence make it the cheapest portable language available.
The classes and interfaces in the KnownSpace API provide a framework for creating impressive flexibility. The Helium application has been created to demonstrate this flexibility as well as the ease with which heterogeneous, arbitrarily-structured data can be queried and manipulated in KnownSpace. Helium is an advanced email navigator. Its point is to allow the user to make sense of, and find things within, large quantities of disorganized email: perhaps hundreds of thousands to millions of messages.
Such applications are not well-understood, so any solution is likely to be only one more step along the road to better solutions. If each solution is unlikely to be complete, high flexibility is a priority. Further, with ever faster and cheaper machines coming out each year, high efficiency is not as important as it used to be. So we value flexibility and reliability above efficiency and completeness. Consequently, the system needs the following functionality:
Normally, programmers try this sort of thing with data that can have several attributes, which can help cluster the data better and provide more intelligence about what the data is and how to manage it, as, for example, in the Be operating system. A natural first step toward a design then is to have one class, let's say Entity, to hold data, a second class, Attribute, to hold attributes, and a third class, Link, to establish links between them.
There is however, no clear distinction between entities and attributes in the general case. An email message, for instance, seems like a natural entity (chunk of data), but it could also be an attribute of another entity, say, a message thread, which could be an attribute of yet another entity, namely, a cluster of related htreads, and that cluster could be an attribute of a yet more general entity, a topic or subject.
Each of the above possible entities: a message, a thread, a cluster, a topic, could also be an attribute of any other. For example, a message might have topics which do not apply generally to its thread, or might be in a cluster to which the rest of its thread does not belong. Similarly, an email message has many attributes (subject, sender, recipient, date, body, signature, attachments) but it can also be an attribute of a person, a mailing list, a thread, a cluster, and even a subject.
As for entity linkage, that's normally done with some form of hierarchy: a cluster has threads, a thread has messages, a message has a recipient. This view of reality is, however, inflexible since it cannot capture, for instance, facts like messages having multiple recipients, or a cluster containing single messages but not their respective threads. Hierarchy alone is insufficient to capture the complex nuances of electronic communication.
So for greater flexibility we chose only one class,
Entity, to implement all three concepts:
data, attributes, and links. Entities have values (their data) and links to other
entities (their data relationships). Linkage between any two entities establishes a
relationship between the data they each store. There are no limitations on what data an
entity can store, and no limitations on which entities that entity can be linked to, so
developers may build arbitrary data and arbitrary data relationships.
To search for entities we need some way to specify sets of them---that is, conditions on
their values and interconnections to other entities. Normally, at least in relational,
object-relational, and object-oriented databases, that is done with templates, which are
sequences of strings specifying what a particular numbered value should or shouldn't be.
A template, however, is too limiting to accommodate arbitrary data and relationships.
Instead, we use a
arbitrarily expandable way to specify arbitrary conditions on entities.
Constraints describe sets of entities and they can be used both for searches and for subscriptions (about which more later). And entities themselves are flexible enough to store sets of entities as well---either as their value, or as the value of one of their attributes, or as their attributes themselves. Thus we have a way to describe sets of entities (constraints) and a way to represent and store sets of entities (entities themselves). Further, constraints can come in unlimited variety and combination, so developers can accommodate future entities simply by writing new constraints to describe them.
To use entities in constraints for search we gave each entity a name as well as a value and a set of links since it's customary to refer to chunks of data with names. An entity's name could, of course, be just a value of another entity linked to the first entity, but for such a common use that would double the number of objects to little purpose, plus of course the new entity would also have to be named so that it could be searched for. So we include an entity's name as part of its state, rather than just as the value of one of its attributes. However, other properties that are normally taken as special, like size or last modification date, are just values of attributes of the entity and so don't need special methods to access them.
To search for entities based on their properties (name, value, and linkage) we invented
another class, a
Pool, which stores all the
entities. To delete an entity we remove it from its pool. Pools are also crucial for
subscriptions, as we'll see later.
Pools present many difficult design problems, however. For example, if one pool is
inside another pool, can the second pool be put into the first pool as well? Also, if
entities can belong to multiple pools (which seems reasonable) then which pool, if any,
has jurisdiction when we eventually add permissions to the system? Finally, pools are
similar to entities, since entities can themselves store sets of entities. However,
Entity, or deleting
entirely in favour of
Entity, means making all the above design decisions
right now. So we decided instead to leave
Pool alone for now and only
allow one (default) pool. Later, when we start supporting multiple users we will add
support for multiple pools.
In sum, so far we have found ways to represent data (entities), represent data relationships (entities again), represent sets of data (entities yet again), describe sets of entities (constraints), store entities (pools), and search for entities (pools again). With economy of design, only four classes accomplish all six of these functions. To that set of functions we next add an ability to transparently store entities and their relationships indefinitely by making them persistent.
The standard way to accomplish persistence is to require that all objects be
Serializable. It's easy enough to make entities
Entity is under kernel control, but to ensure that their values
Serializable we have to explicitly require that all entity values
Serializable. This, however, would not be particularly extensible.
Suppose, for instance, we later decided that all entity values should also implement
Comparable or some other interface. To
accomplish that would require changing a lot of code and there would be no easy way to
make sure that all entity values do in fact also implement the new interface.
To get around such expandability problems, all entity values must implement the
EntityValue interface. If we require
future changes to the requirements for entity values, all we need do is change the
EntityValue interface and all non-conforming code will automatically break,
forcing correct updating of the entire system.
To keep the system as expandable as possible we also want developers to be able to use
objects of any class as entity values. This presents problems, however, when the
class's source code is unchangeable and the class itself is unsubclassable, such as, for
example, java.net.URL. In such cases, developers should create a proxy class
EntityValue and have it also implement the same interface
as the class the developer wishes to use for particular entity values, then delegate
from each of the
public methods to the equivalent
method of the real entity value class.
In this way, eventually all the commonly used classes will have
versions and future developers will not have to do any conversion work themselves. To
further simplify the developer's job we provide several classes that already implement
EntityValue interface. These classes let developers create entity
values for standard values like integers, doubles, booleans, strings, dates, and
This solution isn't particularly elegant since it forces some code duplication for new unmodifiable and unsubclassable classes but we can see no way around it at present.
All this complexly related, arbitrarily searchable, and arbitrarily extensible persistent data is pointless without any way to write code to manipulate it, so we also need support for applications. Normally, programmers view applications as single, monolithic things, but that is inflexible. A highly coupled program is hard to modify---and if it is large, it is also hard to create in the first place. Further, it is hard for teams of independent developers to add or subtract functionality. Also, it is difficult to dynamically add or subtract functionality without a lot of forethought. Finally, it is likely to be unreliable because it is often also brittle since so many parts intimately depend on so many other parts.
To escape these problems, and therefore increase the flexibility and reliability of the system as a whole, we encourage the writing of application code in small, independent, plug-compatible chunks that can better function as mix-and-match components. Such a style of programming lets many developers work on many different parts simultaneously and separately, and they can then combine those parts in many more ways than before with far less developer interaction than today's monolithic style.
Currently, we support two (possibly three) ways to add code to the system. The first is
to write a
Simpleton, which is
essentially a runnable piece of code, except that the kernel manages thread allocation
and control for efficiency (and to leave room later on to add permissions). Most
affordable machines today can only support a few hundred Java threads, and creating a
new thread is expensive, so we will eventually pool threads and hand them out to new
applications as old applications end. Developers can dynamically incorporate simpletons
into the running system without system recompilation. We discuss the other ways of
adding computations next.
Aside from future security limitations (about which more later), there are no limitations on what a simpleton can do.
Many application tasks do not need a separate thread because they only need to execute
on certain conditions dependent on state change caused by other applications. So
besides simpletons, we also support an
These pieces of code are stored inside the data itself and are triggered whenever the
entities they are stored in are examined by some actor (that is, a simpleton or
ActiveEntityValue). They do not have their own threads.
This is particularly useful for triggered computations on a base entity of the entity the active entity value is stored in (that is, an entity that the entity it is stored in is an attribute of). For instance, an active entity might report how many entities are currently linked to its base entity, or it might periodically save its base entity's value, write its number of accesses to a log file, ping the network, popup a window for a user to enter a password, and so on. It may do even more general things too, since, again aside from future security limitations, there are no limitations on its action. It may even alter itself.
For the kernel to distinguish between active and passive entity values, and exploit that
knowledge to improve performance, most persistent entity values should implement the
This separates the methods specific to each entity value interface---active and
passive---and from the supertype,
Finally, although no actors of this type yet exist, it should be possible to make
"active entities", that is subclasses of
Entity that are also
actors---although we have none at present. We don't know which of the three kinds of
actors is better, or whether all three should continue to work together, or in what
combinations, since those decisions may depend on the application, so we allow all
three: simpletons, active entity values, and, possibly, active entities.
At this point we have an extremely decentralized architecture. Small, independent actors are loosely coupled with the data and with each other so developers can dynamically attach arbitrary computations to arbitrary data. Next we need some way to let actors communicate to accomplish larger tasks.
We want to let all these actors working on all this data communicate to accomplish their
various tasks. One standard way to do this is to have direct method calls between
actors, but that would make the system inflexible since it would force strong coupling
between the actors doing the calling. Instead, we support a mechanism to report a
DataManagerEvent (that is, report state
change) either for data or actors, which other data or actors can then trigger on.
Developers can arbitrarily extend
DataManagerEvents to support publication
of new types of state change as they add new types of data or actors to the system.
Following event delegation implementations like Swing, it might seem reasonable to only allow entities and pools to generate events, but forcing that to always be the case is inflexible because we have no idea what new class of objects someone may want to add to the system in future. So it seems reasonable for any object whatsoever to be able to generate events.
In static implementations of the Observer design pattern, like the Java 1.1 event
delegation model, this sort of thing is handled on an object-to-object basis. Each
object advertises the particular set of events it can generate (via the
addXXXListener() methods it supports) and other objects subscribe directly
to particular objects to be alerted about that object's state change. While this might
be fine for user interfaces with a fixed number of widgets, it is inflexible since it
increases the coupling of the system. All objects must be directly aware of any object
they wish to subscribe to, instead of only the particular state change they wish to
notice. Further, it means that every event-generating object has to have a plethora of
compile-time declared methods to advertise their event-generating capability.
Instead, we have an
interface for objects wishing to generate events, and an
EventHandler interface for those
wishing to catch events. Any piece of code, data or actor, can implement
EventHandler (there is a separate event thread inside the kernel to pass on
the handle requests), and, by tossing events into the pool, any piece of code can also
be a (kind of) proxy
EventGenerator without needing special kernel support.
In this scheme, a pool itself becomes the proxy source of events---as in Linda and its modern descendants, JavaSpaces and TSpaces---and that decouples sender from receiver. The sender doesn't need to know that the receiver exists, the receiver doesn't need to know that the sender exists, and the sender and receiver don't even have to exist at the same time. Senders can generate events for receivers that don't exist yet and receivers can listen to events from senders that no longer exist. (At the moment, however, our present implementation of the event channel requires them to coexist simultaneously since we don't presently store events in the pool until needed.)
The only knowledge that sender and receiver need to share is the knowledge of existence
of a pool and the knowledge of the particular type of events to throw in or fish out.
This lets actors notice state change in data or in actors and therefore
"communicate"---but without the flexibility disadvantages of direct coupling or numerous
Listener methods. A pool thus acts as an event channel (an event queue and
router) as in the InfoBus architecture.
In case developers need it, we also support another way to notice state change, and that
is to subscribe to an entity directly (all entities and pools are
EventGenerators), but we discourage its use since it couples sender and
Finally, the overall best way to communicate when there is no time pressure on the communication act is to simply attach a particualr attribute to the entities to be further worked on, then future actors can find them through a search on the pool and then work on them at their leisure. Events are only necessary when there is a need for urgent action.
To describe arbitrary sets of events we again use constraints, thus any actor can subscribe to a pool to listen for events in any potentially arbitrary set of events. This gives developers an arbitrarily extensible event notification scheme.
Finally, we need some way to report execution error specific to the system, as opposed
to the Java virtual machine the system is running on. For that we support
DataManagerException. Right now there is
only one, but we expect to add many more as the system refines itself over the next year
to report more fine-grained exceptional conditions. Like the rest of the system,
developers are free to arbitrarily extend the
hierarchy of classes and interfaces.
Much has to happen behind the scenes to give the appearance of no limitations while still allowing room for the kernel to implement security protocols, support persistence, enforce data integrity, and manage system resources.
The kernel's first line of defence is to use proxies for entities to limit arbitrary
message passing. There is no way, for instance, for non-kernel applications to acquire
a reference to an actual entity; all an application developer can ever have is a
reference to an entity proxy. This is why, for instance, class
public constructor; all entity creation requests must go through
Entity.create(). The lack of direct references is transparent to the
developer because the entity proxy implements the same interface as the (real)
Entity class. Entity proxies can then function as smart links,
tracking or disallowing any method access to any entity from any actor or entity, should
the kernel deem it necessary.
The kernel's second line of defence is to disallow raw thread creation. Currently, the only legal way to get a new thread is to create a simpleton, and even then there is no guarantee that the new simpleton will be given a thread immediately. All threads will eventually be pooled, and handed out only as other simpletons die. Any unauthorized threads will be killed. Thread pooling will let the kernel control the number of threads and also will let it register each thread to a particular simpleton, so accountability is always enforceable. Finally, it also will let the system run on low-memory or slow machines, although so far we have not developed that capability.
By controlling entities and threads the kernel has the beginnings of both a memory manager and a process manager, so priorities, scheduling, authentication, caching, and logging are possible (although only weakly implemented at present).
Here is the package structure for the kernel and related packages:
org datamanager kernel (Entity, EntityValue, Simpleton, Pool, EventHandler, and EventGenerator) tools (tools like Debug) passiveentityvalue (passive entity values) activeentityvalue (active entity values) constraint (constraints) event (datamanager events) exception (datamanager exceptions)