Helium Developer Documentation

The KnownSpace API Design

We have built an open application framework suitable for those who wish to construct highly modular, decoupled, extensible data management applications.

We chose Java as our implementation language because it offers many improvements to the traditional desktop environment with built-in support for networking, security, multithreading, and advanced interfaces. Further, with no pointers, no multiple inheritance, strong typing, garbage collection, a rooted hierarchy, dynamic linking, component support, and declared exceptions, it is a much cleaner and safer language than C++. Finally, compared to the somewhat cleaner Smalltalk, its licensing and hardware independence make it the cheapest portable language available.

The Task

The classes and interfaces in the KnownSpace API provide a framework for creating impressive flexibility. The Helium application has been created to demonstrate this flexibility as well as the ease with which heterogeneous, arbitrarily-structured data can be queried and manipulated in KnownSpace. Helium is an advanced email navigator. Its point is to allow the user to make sense of, and find things within, large quantities of disorganized email: perhaps hundreds of thousands to millions of messages.

Such applications are not well-understood, so any solution is likely to be only one more step along the road to better solutions. If each solution is unlikely to be complete, high flexibility is a priority. Further, with ever faster and cheaper machines coming out each year, high efficiency is not as important as it used to be. So we value flexibility and reliability above efficiency and completeness. Consequently, the system needs the following functionality:

some way to represent arbitrary data,
some way to represent arbitrary data relationships,
some way to represent arbitrary sets of data,
some way to search for arbitrary data,
some way to store data persistently,
some way to support multiple simultaneous applications,
some way to let applications communicate indirectly,
some way to protect applications from each other,
some way to protect the system from rogue applications,
some way to manage system resources,
some way to allow continuous evolution of the system as a whole.

Entities

Normally, programmers try this sort of thing with data that can have several attributes, which can help cluster the data better and provide more intelligence about what the data is and how to manage it, as, for example, in the Be operating system. A natural first step toward a design then is to have one class, let's say Entity, to hold data, a second class, Attribute, to hold attributes, and a third class, Link, to establish links between them.

There is however, no clear distinction between entities and attributes in the general case. An email message, for instance, seems like a natural entity (chunk of data), but it could also be an attribute of another entity, say, a message thread, which could be an attribute of yet another entity, namely, a cluster of related htreads, and that cluster could be an attribute of a yet more general entity, a topic or subject.

Each of the above possible entities: a message, a thread, a cluster, a topic, could also be an attribute of any other. For example, a message might have topics which do not apply generally to its thread, or might be in a cluster to which the rest of its thread does not belong. Similarly, an email message has many attributes (subject, sender, recipient, date, body, signature, attachments) but it can also be an attribute of a person, a mailing list, a thread, a cluster, and even a subject.

As for entity linkage, that's normally done with some form of hierarchy: a cluster has threads, a thread has messages, a message has a recipient. This view of reality is, however, inflexible since it cannot capture, for instance, facts like messages having multiple recipients, or a cluster containing single messages but not their respective threads. Hierarchy alone is insufficient to capture the complex nuances of electronic communication.

So for greater flexibility we chose only one class, Entity, to implement all three concepts: data, attributes, and links. Entities have values (their data) and links to other entities (their data relationships). Linkage between any two entities establishes a relationship between the data they each store. There are no limitations on what data an entity can store, and no limitations on which entities that entity can be linked to, so developers may build arbitrary data and arbitrary data relationships.

Constraints

To search for entities we need some way to specify sets of them---that is, conditions on their values and interconnections to other entities. Normally, at least in relational, object-relational, and object-oriented databases, that is done with templates, which are sequences of strings specifying what a particular numbered value should or shouldn't be. A template, however, is too limiting to accommodate arbitrary data and relationships. Instead, we use a Constraint, an arbitrarily expandable way to specify arbitrary conditions on entities.

Constraints describe sets of entities and they can be used both for searches and for subscriptions (about which more later). And entities themselves are flexible enough to store sets of entities as well---either as their value, or as the value of one of their attributes, or as their attributes themselves. Thus we have a way to describe sets of entities (constraints) and a way to represent and store sets of entities (entities themselves). Further, constraints can come in unlimited variety and combination, so developers can accommodate future entities simply by writing new constraints to describe them.

To use entities in constraints for search we gave each entity a name as well as a value and a set of links since it's customary to refer to chunks of data with names. An entity's name could, of course, be just a value of another entity linked to the first entity, but for such a common use that would double the number of objects to little purpose, plus of course the new entity would also have to be named so that it could be searched for. So we include an entity's name as part of its state, rather than just as the value of one of its attributes. However, other properties that are normally taken as special, like size or last modification date, are just values of attributes of the entity and so don't need special methods to access them.

Pools

To search for entities based on their properties (name, value, and linkage) we invented another class, a Pool, which stores all the entities. To delete an entity we remove it from its pool. Pools are also crucial for subscriptions, as we'll see later.

Pools present many difficult design problems, however. For example, if one pool is inside another pool, can the second pool be put into the first pool as well? Also, if entities can belong to multiple pools (which seems reasonable) then which pool, if any, has jurisdiction when we eventually add permissions to the system? Finally, pools are similar to entities, since entities can themselves store sets of entities. However, making Pool extend Entity, or deleting Pool entirely in favour of Entity, means making all the above design decisions right now. So we decided instead to leave Pool alone for now and only allow one (default) pool. Later, when we start supporting multiple users we will add support for multiple pools.

In sum, so far we have found ways to represent data (entities), represent data relationships (entities again), represent sets of data (entities yet again), describe sets of entities (constraints), store entities (pools), and search for entities (pools again). With economy of design, only four classes accomplish all six of these functions. To that set of functions we next add an ability to transparently store entities and their relationships indefinitely by making them persistent.

Entity Values

The standard way to accomplish persistence is to require that all objects be Serializable. It's easy enough to make entities Serializable since class Entity is under kernel control, but to ensure that their values are also Serializable we have to explicitly require that all entity values be Serializable. This, however, would not be particularly extensible. Suppose, for instance, we later decided that all entity values should also implement Cloneable or Comparable or some other interface. To accomplish that would require changing a lot of code and there would be no easy way to make sure that all entity values do in fact also implement the new interface.

To get around such expandability problems, all entity values must implement the EntityValue interface. If we require future changes to the requirements for entity values, all we need do is change the EntityValue interface and all non-conforming code will automatically break, forcing correct updating of the entire system.

To keep the system as expandable as possible we also want developers to be able to use objects of any class as entity values. This presents problems, however, when the class's source code is unchangeable and the class itself is unsubclassable, such as, for example, java.net.URL. In such cases, developers should create a proxy class that implements EntityValue and have it also implement the same interface as the class the developer wishes to use for particular entity values, then delegate from each of the public methods to the equivalent public method of the real entity value class.

In this way, eventually all the commonly used classes will have EntityValue versions and future developers will not have to do any conversion work themselves. To further simplify the developer's job we provide several classes that already implement the EntityValue interface. These classes let developers create entity values for standard values like integers, doubles, booleans, strings, dates, and vectors.

This solution isn't particularly elegant since it forces some code duplication for new unmodifiable and unsubclassable classes but we can see no way around it at present.

Simpletons

All this complexly related, arbitrarily searchable, and arbitrarily extensible persistent data is pointless without any way to write code to manipulate it, so we also need support for applications. Normally, programmers view applications as single, monolithic things, but that is inflexible. A highly coupled program is hard to modify---and if it is large, it is also hard to create in the first place. Further, it is hard for teams of independent developers to add or subtract functionality. Also, it is difficult to dynamically add or subtract functionality without a lot of forethought. Finally, it is likely to be unreliable because it is often also brittle since so many parts intimately depend on so many other parts.

To escape these problems, and therefore increase the flexibility and reliability of the system as a whole, we encourage the writing of application code in small, independent, plug-compatible chunks that can better function as mix-and-match components. Such a style of programming lets many developers work on many different parts simultaneously and separately, and they can then combine those parts in many more ways than before with far less developer interaction than today's monolithic style.

Currently, we support two (possibly three) ways to add code to the system. The first is to write a Simpleton, which is essentially a runnable piece of code, except that the kernel manages thread allocation and control for efficiency (and to leave room later on to add permissions). Most affordable machines today can only support a few hundred Java threads, and creating a new thread is expensive, so we will eventually pool threads and hand them out to new applications as old applications end. Developers can dynamically incorporate simpletons into the running system without system recompilation. We discuss the other ways of adding computations next.

Aside from future security limitations (about which more later), there are no limitations on what a simpleton can do.

Active Entity Values

Many application tasks do not need a separate thread because they only need to execute on certain conditions dependent on state change caused by other applications. So besides simpletons, we also support an ActiveEntityValue interface. These pieces of code are stored inside the data itself and are triggered whenever the entities they are stored in are examined by some actor (that is, a simpleton or another triggered ActiveEntityValue). They do not have their own threads.

This is particularly useful for triggered computations on a base entity of the entity the active entity value is stored in (that is, an entity that the entity it is stored in is an attribute of). For instance, an active entity might report how many entities are currently linked to its base entity, or it might periodically save its base entity's value, write its number of accesses to a log file, ping the network, popup a window for a user to enter a password, and so on. It may do even more general things too, since, again aside from future security limitations, there are no limitations on its action. It may even alter itself.

For the kernel to distinguish between active and passive entity values, and exploit that knowledge to improve performance, most persistent entity values should implement the PassiveEntityValue interface. This separates the methods specific to each entity value interface---active and passive---and from the supertype, EntityValue.

Finally, although no actors of this type yet exist, it should be possible to make "active entities", that is subclasses of Entity that are also actors---although we have none at present. We don't know which of the three kinds of actors is better, or whether all three should continue to work together, or in what combinations, since those decisions may depend on the application, so we allow all three: simpletons, active entity values, and, possibly, active entities.

At this point we have an extremely decentralized architecture. Small, independent actors are loosely coupled with the data and with each other so developers can dynamically attach arbitrary computations to arbitrary data. Next we need some way to let actors communicate to accomplish larger tasks.

Events

We want to let all these actors working on all this data communicate to accomplish their various tasks. One standard way to do this is to have direct method calls between actors, but that would make the system inflexible since it would force strong coupling between the actors doing the calling. Instead, we support a mechanism to report a DataManagerEvent (that is, report state change) either for data or actors, which other data or actors can then trigger on.

Developers can arbitrarily extend DataManagerEvents to support publication of new types of state change as they add new types of data or actors to the system.

Event Generators and Event Handlers

Following event delegation implementations like Swing, it might seem reasonable to only allow entities and pools to generate events, but forcing that to always be the case is inflexible because we have no idea what new class of objects someone may want to add to the system in future. So it seems reasonable for any object whatsoever to be able to generate events.

In static implementations of the Observer design pattern, like the Java 1.1 event delegation model, this sort of thing is handled on an object-to-object basis. Each object advertises the particular set of events it can generate (via the addXXXListener() methods it supports) and other objects subscribe directly to particular objects to be alerted about that object's state change. While this might be fine for user interfaces with a fixed number of widgets, it is inflexible since it increases the coupling of the system. All objects must be directly aware of any object they wish to subscribe to, instead of only the particular state change they wish to notice. Further, it means that every event-generating object has to have a plethora of compile-time declared methods to advertise their event-generating capability.

Instead, we have an EventGenerator interface for objects wishing to generate events, and an EventHandler interface for those wishing to catch events. Any piece of code, data or actor, can implement EventHandler (there is a separate event thread inside the kernel to pass on the handle requests), and, by tossing events into the pool, any piece of code can also be a (kind of) proxy EventGenerator without needing special kernel support.

In this scheme, a pool itself becomes the proxy source of events---as in Linda and its modern descendants, JavaSpaces and TSpaces---and that decouples sender from receiver. The sender doesn't need to know that the receiver exists, the receiver doesn't need to know that the sender exists, and the sender and receiver don't even have to exist at the same time. Senders can generate events for receivers that don't exist yet and receivers can listen to events from senders that no longer exist. (At the moment, however, our present implementation of the event channel requires them to coexist simultaneously since we don't presently store events in the pool until needed.)

The only knowledge that sender and receiver need to share is the knowledge of existence of a pool and the knowledge of the particular type of events to throw in or fish out. This lets actors notice state change in data or in actors and therefore "communicate"---but without the flexibility disadvantages of direct coupling or numerous Listener methods. A pool thus acts as an event channel (an event queue and router) as in the InfoBus architecture.

In case developers need it, we also support another way to notice state change, and that is to subscribe to an entity directly (all entities and pools are EventGenerators), but we discourage its use since it couples sender and receiver.

Finally, the overall best way to communicate when there is no time pressure on the communication act is to simply attach a particualr attribute to the entities to be further worked on, then future actors can find them through a search on the pool and then work on them at their leisure. Events are only necessary when there is a need for urgent action.

To describe arbitrary sets of events we again use constraints, thus any actor can subscribe to a pool to listen for events in any potentially arbitrary set of events. This gives developers an arbitrarily extensible event notification scheme.

Exceptions

Finally, we need some way to report execution error specific to the system, as opposed to the Java virtual machine the system is running on. For that we support DataManagerException. Right now there is only one, but we expect to add many more as the system refines itself over the next year to report more fine-grained exceptional conditions. Like the rest of the system, developers are free to arbitrarily extend the DataManagerException hierarchy of classes and interfaces.

System Concerns

Much has to happen behind the scenes to give the appearance of no limitations while still allowing room for the kernel to implement security protocols, support persistence, enforce data integrity, and manage system resources.

The kernel's first line of defence is to use proxies for entities to limit arbitrary message passing. There is no way, for instance, for non-kernel applications to acquire a reference to an actual entity; all an application developer can ever have is a reference to an entity proxy. This is why, for instance, class Entity has no public constructor; all entity creation requests must go through Entity.create(). The lack of direct references is transparent to the developer because the entity proxy implements the same interface as the (real) Entity class. Entity proxies can then function as smart links, tracking or disallowing any method access to any entity from any actor or entity, should the kernel deem it necessary.

The kernel's second line of defence is to disallow raw thread creation. Currently, the only legal way to get a new thread is to create a simpleton, and even then there is no guarantee that the new simpleton will be given a thread immediately. All threads will eventually be pooled, and handed out only as other simpletons die. Any unauthorized threads will be killed. Thread pooling will let the kernel control the number of threads and also will let it register each thread to a particular simpleton, so accountability is always enforceable. Finally, it also will let the system run on low-memory or slow machines, although so far we have not developed that capability.

By controlling entities and threads the kernel has the beginnings of both a memory manager and a process manager, so priorities, scheduling, authentication, caching, and logging are possible (although only weakly implemented at present).

Helium Package Structure

Here is the package structure for the kernel and related packages:

        org
           datamanager
              kernel             (Entity, EntityValue, Simpleton, Pool,
                                  EventHandler, and EventGenerator)
              tools              (tools like Debug)
              passiveentityvalue (passive entity values)
              activeentityvalue  (active entity values)
              constraint         (constraints)
              event              (datamanager events)
              exception          (datamanager exceptions)