The object storage subsystem (OSS) of the present invention presents a system for storing objects; small stand-alone software programs containing both data and functional algorithms, in a locally available network. Overall, the OSS can operate in two modes; data mirroring mode, and data federation mode. Data mirroring mode uses multiple stand alone computing nodes to store multiple copies of the same data. This affords the data mirroring mode high availability, because the data may be retrieved from multiple sources, and high fault tolerance, since the data is stored in multiple locations.
Referring to
An important aspect of the relationship of the data between nodes, is that requests for information from any individual node may simultaneously make requests to other data nodes in the system based on functional algorithms contained in the data of the original node to retrieve data that is not present in the original node. The OSS allows data from all nodes to be used as one monolithic data representation from all points in the distributed system.
Referring to
The data storage module enables the OSS to store data on the nodes of the system. The OSS allows the system to continue operating in spite of any individual failed operation that may occur. This is possible because the Module can renew the ID Table using data stored in Storage files. All of the objects in the system are stored as Storage files, capable of compression if necessary, and reside on the various nodes of the system. There is parameter allowing the number of objects to be set, which can be stored in single Storage file.
A second type of file, known as an ID Table is used to store data about the location of objects relative to storage files on computing nodes, along with state information about the objects. By accessing the ID Table, the OSS has fast access to objects, which improves efficiency. The ID Table file also contains information about links to the object. OSS provides such information every time when any object is being put, updated or removed from storage. When an object is considered obsolete by the garbage collector module, the object and the data comprising it can be automatically deleted from the system.
The garbage collector module periodically checks the ID Table to locate objects and data that are no longer linked to any other objects, and therefore have fallen out of transitive closure in the dataset. Transitive closure is, from the root node of a dataset, all objects that can be reached by traversing the graph of object references. Since these objects will no longer be used by the program, they may be deleted from the database. In addition, any storage files that have a small number of objects may be consolidated.
The frequency of garbage collection, and the number of objects within a file are controlled by parameters within the garbage collector.
The transaction isolation module (provided by an open source database) sets locks against objects involved in a transaction, and the Distributed Lock distributes these locks across the nodes of the network. This allows the OSS to manage transactions. First the Distributed Lock tries to lock the required object on the node where the object is located. If the object has been locked successfully, the Distributed Lock sends to all other nodes the message with information about locked object. On each node the Distributed Lock, after receiving this information, provides a lock for the object even if the object doesn't reside in that node.
The transaction handler module receives messages regarding “commit” and “rollback.” Commit commands indicate that a transaction within the network is to be completed, whereas a rollback command indicates that a transaction should be reversed so that it appears to have never occurred. The transaction handler module distributes these messages between nodes and executes a commit or rollback command by sending the appropriate data to the data storage module. The data storage module then makes changes to the ID Table regarding objects, and makes changes to the files containing those objects.
The internode communication module enables the system to use different communication protocols. In a preferred embodiment, the implementation uses JGroups over TCP/IP. The communication module is used by the transaction isolation module and the transaction management module to allow communication between nodes.
Referring to
The transaction isolation system (using the Distributed Lock) is in contact with the transaction management system, which receives and sends commit and rollback commands between the object storage subsystem and the nodes (by the transaction handler), allowing changes to be made to the objects and data.
The transaction management system communicates with the main object storage subsystem, which maintains the storage files for the objects, the ID Table. The ID Table contains information regarding the storage files, including location information, object state information (this indicates whether the proxy for object is still used by client), and object references. The garbage collector module monitors the object references and deletes obsolete objects and data, improving the performance and efficiency of the system.
The present invention clusters objects and conducts transaction management in a manner to increase operational efficiency. In the current art, objects are joined into so-called clusters. Each cluster is stored in one file on the file system of an OS. At run time, the size of cluster (the quantity of objects which can be contained in one cluster) is fixed and is defined by parameters described in the system configuration file before the system has been started. When a client application creates a new object and stores it in a database, the system assigns an ID for the object and stores the object in the first cluster with a current quantity of contained objects less than the number of contained objects defined by configuration parameter. A client application can call the object by its ID or its name. If the object has name, this name is stored in special file called Name Table where the object's ID is stored by the object's name. When the client application calls for the object by its name, the system finds the object ID in this table.
At run time, if the client application calls an object, the system (using the object ID) finds the appropriate cluster and reads the object from there. After the object has been changed the client (by executing the appropriate call) the system puts it in the DB and the system stores it in the cluster where the object was contained before. If the client application used a number of objects in one transaction—which occurs frequently—each used object will be restored in its own cluster. When the system loads the called object it loads the cluster containing this object. Therefore, if the transaction locks the object, all clusters containing the required object are also locked. This scheme has the following disadvantage: In the event two different objects stored in one cluster should be used in two different transactions; one of the transactions will be blocked as long as the other one has not been committed. This scenario can be represented by the following equation:
T=Tt1+Tt2
Where T is the time spent for executing two transactions t1 and t2, Tt1 is the time spent for executing transaction t1 and Tt2 is the time spent for executing transaction t2. This is true because t2 will be blocked as long as t2 is not committed. This type of storage mechanism is ineffective and results in lower productivity.
By comparison, the present invention provides a more effective method of storing data. In the present invention, objects in storage are grouped in clusters, however each cluster contains objects which have been stored in one transaction when the transaction is being committed. Therefore, when a client application calls an object, the system loads the cluster containing the object. However, the system doesn't lock the loaded cluster. Instead, the system stores all objects contained in the loaded cluster in an object cache, with required objects locked per transaction. When a transaction is being committed, all the objects used in it are grouped in one cluster and this cluster is stored in a new file in file system of the OS. Then system then provides appropriate changes in an ID Table file where the current location of object is stored by object ID. Using this scheme, clusters that do not contain actual objects are deleted. Furthermore clusters that contain a small number of objects are re-grouped into clusters containing a more appropriate number of objects. This improved system can be represented by the equation:
T=max(Tt1,Tt2)<T=Tt1+Tt2
where T, t1, t2, Tt1 and Tt2 are the same as in the first equation. In the present invention, the ID Table is not simply a map where object locations are stored by their respective IDs. Rather, when an object is stored, information regarding all references to other relevant objects is provided. This information is useful for other purposes as well, such as garbage collecting.
The present invention also provides a novel disk storage system and system optimization feature. In the present art, when objects are saved onto disk, the processor's time is spent not only for writing the object's data, but also for “overhead expenses.” Overhead expenses are of two kinds; expenses for file creation and opening, and expenses for finding old instances of an object within a file to overwrite it. If all objects are stored in one file, the expense for file creation and opening is minimal, but the expense for finding the object is increased. By contrast, if each object is stored in separate file, the expense for finding an object will be minimal but the expense for creating and opening the file will be greater.
In systems currently available in the marketplace, all objects stored in a database are grouped into clusters, and each cluster is stored in a separate file. The size of a cluster is fixed (by configuration parameters) and objects are added to the cluster until its size limit is reached. This scheme has several shortcomings: In some cases, when a transaction is committed, saved objects can be placed in different clusters (potentially with the number of objects equal to the number of clusters). Furthermore, object locks (used for transaction isolation) are not based on objects but rather on clusters (for instance row-locking and page-locking in RDBMS) which leads to unnecessary transaction blocking.
To minimize overhead expenses and to eliminate shortcomings currently in the art, the present invention modifies objects within one transaction and groups them in one cluster. They are then stored in one file. In addition to eliminating the above problems, this scheme of object grouping has some additional advantages: It eliminates the needs to synchronize data storing from different transactions; and it allows the system to use load-ahead cache population with a high successful rate (based on combined use of objects). When a transaction is being completed, all domain objects modified in the transaction are stored in one file. For each domain object the system creates a utility object—ContainerLocation. This object contains information about the name of a file that contains a given object, a list of other objects the given object is referring to, and other info. The ContainerLocations are put into an ID Table, which contains pairs of object ID and ContainerLocation for each object. Therefore, the ID Table contains the location of the last version of each object. Moreover the ID Table monitors the number of active objects located in a cluster and deletes the cluster if it fails to contain any active objects.
None This application claims the benefit of the priority date of provisional application No. 60/816,024, filed on Jun. 21, 2006.
Number | Date | Country | |
---|---|---|---|
60816024 | Jun 2006 | US |