The present invention generally relates to data stream processing, and more particularly relates to storage for data stream processing systems.
Unstructured information represents the largest, most current and fastest growing source of knowledge available to businesses and governments. This information is typically processed in real time by high-performance data stream processing systems.
The first, payload-free information unit 206 is advanced to analytic processing stages (executed by a plurality of processing units), while the second information unit 208 is sent to storage. Any processing unit may later access data needed to refine content interpretation from the second information unit 208 using the retrieval key. Eventually, unused data from the second information unit 208 is either discarded or transformed into a reporting form (such that the retrieval key is no longer required). Subsequently, all information units are discarded at a time of egress of last access.
Typical data stream processing systems employ a server running a sophisticated database to provide scalable archiving of data. However, scalability issues remain for massively expanded data stream processing applications, no matter how robust the use of the database server is. This is due, in part, to the “distance” of the processing units from the database server, which can add network hops and congestion, slowing connectivity for data storage and retrieval. The need to maintain indices and other data storage artifacts that permit rapid data retrieval also adds to the cost of maintaining a repository.
Therefore, there is a need in the art for a method and apparatus for scalable storage for data stream processing systems.
In one embodiment, the invention is a method and apparatus for scalable storage for data stream processing systems. One embodiment of a system for processing a data stream, includes a first set of processing elements configured for processing of at least the lightweight portion of an information unit and a second set of processing units configured for storage of the heavyweight portion of the information unit.
The teachings of the present invention can be readily understood by considering the following detailed description in conjunction with the accompanying drawings, in which:
To facilitate understanding, identical reference numerals have been used, where possible, to designate identical elements that are common to the figures.
It is to be noted, however, that the appended drawings illustrate only exemplary embodiments of this invention and are therefore not to be considered limiting of its scope, for the invention may admit to other equally effective embodiments.
The present invention is a method and apparatus for scalable storage for data stream processing systems. Embodiments of the invention provide many advantages over traditional data stream processing systems. By arranging processing units in a delay ring and allowing them to be raveled through advanced processing units, the “distance” between the advanced processing units and the delay ring storage can be minimized. This relieves network hops and congestion, thereby speeding connectivity for data storage and retrieval. Moreover, the system eliminates or reduces the need for costly disk storage and index table maintenance.
In practice, an incoming data stream 306 is received by a processing unit 3021, and original information units from the data stream 306 are split into a first, lightweight information units (comprising annotations, retrieval keys and other potentially “interesting” data) and second, heavyweight information units comprising bulk data (i.e., the payload and essential annotation), as discussed above with respect to
If a processing unit 302 in the first set of processing units requires a bulk data item corresponding to a given first information unit, the processing unit 302 uses the retrieval key in the first information unit to set a “flow criteria” for accepting a copy of the second information unit (i.e., the second information unit that corresponds to the first data unit) from a desired point on the delay ring 304, as illustrated in phantom by stream connection 308. The more points that are collected across a sparse setting, the lower the latency will be to retrieve the re-circulating second information unit. The original information unit (i.e., comprising the corresponding first information and second information unit) is only discarded when some final use of the data is performed or transformed, and the performance or transformation is broadcast by a finalizing processing unit 302. In one embodiment, the second information unit is discarded when the corresponding first information unit is discarded.
The system 300 provides many advantages over traditional data stream processing systems. By allowing the processing units (e.g., 3025-302n) in the delay ring 304 to be raveled through advanced processing units (e.g., 3022-3024), the “distance” between the advanced processing units and the delay ring storage can be minimized. Moreover, the system 300 eliminates or reduces the need for costly disk storage and index table maintenance.
For instance, if one wished to expand the storage capacity of a system originally comprising only the first delay ring 4001, one would construct the second delay ring 400n and then set one of the processing units 402 in the second delay ring 400n to “subscribe” to the output flow of a processing unit 402 in the first delay ring 4001. This is illustrated in phantom by stream connection 404, by which a “first” processing unit 4029 of the second delay ring 400n subscribes to the output of a “last” processing unit 4023 of the first delay ring 4001. The stream connection between the “last” processing unit 4023 of the first delay ring 4001 and a “first” processing unit 4024 of the first delay ring 4001, to which the “last” processing unit 4023 previously forwarded its output, is then terminated, as illustrated by broken stream connection 406. The “first” processing unit 4024 of the first delay ring 4001, which is now receiving no data as a result of the broken stream connection 406, is then set to “subscribe” to the output of a “last” processing unit 402n of the second delay ring 400n, as illustrated in phantom by new stream connection 408. The retention capacity of the data stream processing system is thus increased by adding processing units 402 to store and forward information units (payload).
Conversely, if one wanted to reduce the storage capacity of a system originally comprising both the first delay ring 4001 and the second delay ring 400n, one would first break the stream connection 404 between the “first” processing unit 4029 of the second delay ring 400n and the “last” processing unit 4023 of the first delay ring 4001. This forms a bottleneck of information units in the chain of processing units 402 from the “last” processing unit 4023 of the first delay ring 4001 and those processing units 402 upstream. Once the last information unit has left the “last” processing unit 402n of the second delay ring 400n, the “first” processing unit 4024 of the first delay ring 4001 is set to “subscribe” to the output of the “last” processing unit 4023 of the first delay ring 4001. This completes the first delay ring 4001. The stream connection 408 between the “first” processing unit 4024 of the first delay ring 4001 and the “last” processing unit 402n of the second delay ring 400n is then broken, and the processing units 402 of the removed second delay ring 400n are free for other use. Thus, the present invention enables scalable parallelization of data storage and retrieval by allowing storage to be sectionalized across multiple delay rings (each delay ring having at least one processing unit).
Thus, the present invention represents a significant advancement in the field of data stream processing. Embodiments of the invention provide many advantages over traditional data stream processing systems. By arranging processing units in a delay ring and allowing them to be raveled through advanced processing units, the “distance” between the advanced processing units and the delay ring storage can be minimized. This relieves network hops and congestion, thereby speeding connectivity for data storage and retrieval. Moreover, the system eliminates or reduces the need for costly disk storage and index table maintenance.
While the foregoing is directed to the illustrative embodiment of the present invention, other and further embodiments of the invention may be devised without departing from the basic scope thereof, and the scope thereof is determined by the claims that follow.
This invention was made with Government support under Contract No. H98230-05-3-001, awarded by Intelligence Agency. The Government has certain rights in this invention.