This invention relates to a stream data processing method and a stream data processing device.
There is an increasing demand for data processing systems that process in real time a large amount of data arriving minute by minute. Automatic buying and selling of stocks, car floating, Web access monitoring, and manufacturing monitoring can be given as examples.
Database management systems (hereinafter abbreviated as DBMSs) have been hitherto positioned at the center of data management in corporation information systems. DBMSs store data to be processed in storage, and perform highly reliable processing, typically, transaction processing, on the stored data. With DBMSs, however, search processing is conducted for every piece of data each time new data arrives, which makes it difficult to satisfy requirements of the real-time processing described above. In the case of a financial application for assisting stock trading, for instance, one of the most important objectives of the system is how quickly the system can react to fluctuations in stock prices. Data search processing in the conventional DBMSs described above is incapable of keeping up with the speed at which stock prices change, which has a very real possibility of causing the corporations to miss a business chance.
Stream data processing systems are proposed as data processing systems suitable for real-time data processing of this type. For example, a stream data processing system “STREAM” is disclosed in R. Motwani, J. Widom, A. Arasu, B. Babcok, S. Babu, M. Datar, G. Manku, C. Olston, J. Rosenstein and R. Varma, “Query Processing, Resource Management, and Approximation in a Date Stream Management System”, in Proc. of the 2003 Conf. on Innovative Data System Research (CIDR), January 2003 (Non-Patent Literature 1). In stream data processing systems, a query (inquiry) is registered first to the system, unlike conventional DBMSs, and the query is executed continuously as data arrives. Because what queries are executed can be known in advance, only a differential from the past processing result is processed when new data arrives, thereby making high-speed processing possible. Stream data processing thus enables a corporation to analyze, in real time, data that is generated at a high rate as in stock trading, and to monitor for and make use of an event that is useful to the business.
A premise of stream data processing is that input data is in chronological order, which allows a stream data processing system to sequentially process data as soon as the data is input and thus implements real-time processing. In the case where data is input from nodes (computers) installed in dispersed bases, such as stock exchange markets, base stations, or electricity meters, data from one base and data from another base are not input in chronological order, a naive method therefore accomplishes stream data processing by sorting data in chronological order at the time the data is input. However, when there are many nodes at dispersed bases or the geographical distance between bases is great, the chronological sort at data input adds to a memory cost and a processing latency. Countermeasures for input data that is not in chronological order are disclosed in US 2008/0072221 (Patent Literature 1), US 2009/0172058 (Patent Literature 2), US 2009/0172059 (Patent Literature 3), US 2010/0106946 (Patent Literature 4), J. Li, K. Tufte, V. Shkapenyuk, V. Papadimos, T. Johnson, D. Maier, “Out-of-Order Processing: a New Architecture for High-Performance Stream Systems”, in Proc. of the VLDB Endowment, 2008 (Non-Patent Literature 2), and B. Babcock, S. Babu, M. Datar, R. Motwani, and D. Thomas, “Operator Scheduling in Data Stream Systems”, 2005 (Non-Patent Literature 3). “Memory cost” refers to the amount of memory mounted on a computer that is spent to hold data waiting to be processed or the like. “Processing latency” is a delayed time, namely, a length of time from the input of stream data to a computer that processes data till the output of the data.
Patent Literature 2 and Patent Literature 3 keep the memory cost and the latency from increasing by not always waiting for input data that does not arrive in time in processing of aggregating non-chronological input data, and thus calculating the result of the processing as an approximate solution. However, data consistency or processing consistency cannot be maintained by the processing using an approximate solution alone, and Patent Literature 2 and Patent Literature 3 are therefore applicable only to a limited range of business operations and the like.
Patent Literature 1 keeps the memory cost and the latency from increasing by sending to the stream processing system a signal for permitting processing to proceed in the case where data arrival does not occur for a given period of time or longer. A problem is that, because data that arrives after a time specified in advance is discarded, processing consistency cannot be maintained.
In Non-Patent Literature 2, non-chronological inputs to a stream are permitted. A control packet for ensuring the explicit progress of time is sent to be input to an operator. The operator keeps processing data until a time of the control packet is reached. The processing of Non-Patent Literature 2 has a problem in that, when control packets are transmitted frequently, the need to process the control packets deteriorates the processing ability of a computer. Another problem is that widening the interval of control packet transmission increases the processing latency and the memory cost in the processing of Non-Patent Literature 2, where each operator waits for a control packet before executing processing.
Accordingly, it cannot be said that the known examples described above are satisfactory with regard to processing consistency and performance such as the latency and the memory cost.
An object of this invention is therefore to keep the processing latency and the memory cost from increasing while maintaining processing consistency.
A representative aspect of this invention is as follows. A stream data processing method for receiving stream data that is constituted of input data comprising time and executing processing of the stream data in accordance with a query registered in advance in a stream data processing device which comprises a processor and a memory, the stream data processing device comprising: a data inputting module which receives a plurality of pieces of input data constituting the stream data; a query registering module which receives a first key, a definition of the stream data, and a definition of the query to generate operators for processing the plurality of pieces of input data, the first key specifying, as data sets, items of the plurality of pieces of input data that are used as units by which the plurality of pieces of input data are processed in chronological order; and a data executing module which determines, for each of the data sets, which of the operators is to process the input data of the data set, and outputs a result of processing the input data of the data set with the operator, the stream data processing method comprising: a first step of receiving, by the query registering module, the first key, the definition of the query, and the definition of the stream data, and setting data sets to be processed in chronological order out of items comprised in the plurality of pieces of input data; a second step of receiving, by the data inputting module, the plurality of pieces of input data to generate an input stream; a third step of receiving, by the data executing module, the input stream and processing input data that is comprised in the input stream with the operators on a data set-by-data set basis; and a fourth step of generating results of the processing of the operators as a single output stream.
A reduction in processing latency and memory cost can be accomplished in stream data processing that handles non-chronological input data while maintaining processing consistency.
Embodiments of this invention are described below with reference to the accompanying drawings.
The stream data processing server 108 is a computer in which an I/O interface 115 constituting an interface unit, a central processing unit (CPU) 113 constituting a processing unit, and a memory 109 serving as a storing unit are joined to one another by a bus.
The stream data processing server 108 accesses the networks 104, 107, and 116 via the I/O interface 115. The CPU 113 (or the processor 113) can use storage 114, which is a storing unit, as non-volatile storage for storing a result of stream data processing, an intermediate result of processing, or settings data necessary for system operation. The storage 114 is connected directly with the use of the I/O interface 115, or is coupled via a network with the use of the I/O interface 115. The memory 109 stores, as modules constituting stream data processing, a query registering module 111, a data inputting module 110, and a data executing module 112. The operation of the respective modules is described later.
The first embodiment is described below with reference to the drawings. The first embodiment discusses a method of carrying out this invention based on operator scheduling that is disclosed in Patent Literature 4.
First, settings data which includes a stream and query definition 206 written by a user 204 and a data set key 205 is stored in the registration server 105 and transmitted from the registration server 105 to the stream data processing server 108. After receiving the settings data, the stream data processing server 108 uses a data set key reading module 211 of the query registering module 111 to generate a data set conversion table 210 and an execution area name reference table 215 from the data set key 205. The stream data processing server 108 also uses a compiling module 212 to compile the stream and query definition 206 and generate an operator tree 228.
After the stream data processing server 108 generates the data set conversion table 210, the execution area name reference table 215, and the operator tree 228, the sending servers 101 to 103 keep sending input data 201 to input data 203 to the stream data processing server 108.
The input data 201 to input data 203 are received by an input data receiving module 207 of the data inputting module 110 in the stream data processing server 108. The received input data is stored in an input data storage area 208. The input data storage area 208 is constituted of queues for temporarily holding the received input data 201 to input data 203. A partial chronological partition processing module 209 uses the data set conversion table 210 to partially sort, in chronological order, the input data 201 to input data 203 stored in the input data storage area 208. The sorted input data, which is denoted by 213, is stored in an input stream 214. The data inputting module 110 outputs the input stream to the data executing module 112.
The data executing module 112 uses an execution order determining module 217 to take the stored input data 213 out of the input stream 214. The execution order determining module 217 refers to the execution area name reference table 215 to extract an execution area name 216 when there is at least one of execution areas 218 and 219, and to generate a new execution area when there are no execution areas. The execution areas 218 and 219 are each constituted of a stream data storage queue 221, an ignition time reference table 222, and an execution state 223.
The execution order determining module 217 stores the input data 213 in the stream data storage queue 221 of the execution area 218 specified by the extracted execution area name 216, and uses the ignition time reference table 222 of the execution area 218 to extract execution data 224 and an execution operator 226.
An operator processing module 227 of the data executing module 112 executes given processing using the execution data 224, which is extracted by the execution order determining module 217, and the operator tree 228 of the execution operator 226. When executing the processing, the operator processing module 227 uses the execution state 223 of the execution area 218 specified by the execution area name 216.
Output data 229 which is output as a result of the execution of the operator processing module 227 is stored in an output stream 231 by a data outputting module 230. Data stored in the output stream 231 is received by the receiving server 117. In the case where operator processing is to be further performed on the data stored in the output stream 231, the data stored in the output stream 231 is received by the input data receiving module 207.
Details of the operation in the first embodiment are described next.
The definition 302 of
For these stream definition 301 and query definition 302, the data set key 205 (303 and 304) of
The data set key 205 is constituted of the data set key 304 for an input stream (a first key) and the data set key 303 for input data (a second key). The first embodiment discusses an example in which “meter” out of the columns of input data is set as the data set key 303 for input data, and “installed location” out of the columns of input data is set as the data set key 304 for an input stream.
The data set key 303 for input data indicates that pieces of input data that have the same value in a specified column are organized in chronological order to be input to the data inputting module 110 of the stream data processing server 108. Specifically, the data set key 303 indicates that pieces of input data are input organized in chronological order for each “meter”.
The data set key 304 for an input stream indicates that pieces of input data that have the same value in a specified column are treated as a group processed in a query to be processed in chronological order in the input stream 214 of the data executing module 112. Specifically, the data set key 304 indicates that pieces of input data are sorted by “installed location” to constitute data sets so that input data is processed in chronological order on a data set-by-data set basis. The data set key 304 for an input stream also indicates that the data executing module 112 generates as many execution areas as the number of types of “installed location” of input data. In other words, a data set is constructed for each “installed location” type of input data.
The data set key 303 for input data and the data set key 304 for an input stream may be specified by writing the data set keys in a query, or may be specified via a settings file, or may be specified by other methods.
In the example of
The example of
In
In
The stream data storage queues 1002 and 1015 are queues for storing the input data 213 in the input stream 213 that has the data set key value 402 for an input stream for the input stream 214. The stream data storage queues 1002 and 1015 are collectively denoted by reference symbol 221.
The ignition time reference tables 1004 and 1017 respectively hold information about processing of input data stored in the stream data storage queues 1002 and 1015, specifically, operators 1005 and 1018 which execute the processing and ignition times 1006 and 1019 which are times to execute the operators.
For example, in the case where the ignition time 1006 of the operator RANGE in the execution area A (218) has “7:51” as the time of the oldest data in the operator RANGE, the next ignition time is “8:01” (1007), which is calculated by adding ten minutes to the last execution time, “7:51+10 minutes”, because the tally period is a ten-minute window as indicated by the electricity consumption tally query definition 302 of
The execution states 1008 and 1021 respectively indicate states that are used in processing executed by the operators on input data stored in the stream data storage queues 1002 and 1015. For instance, in an execution state A1 (1011) and an execution state B1 (1024) which are execution states of the operator RANGE, input data that contains a time within ten minutes from the current time is stored in the respective execution areas because the tally period is a ten-minute window. In an execution state A2 (1012) and an execution state B2 (1025) which are execution states of the operator GROUP BY, an aggregation value of input data that contains a time within ten minutes from the current time is stored.
After the compiling, the query registering module 111 starts the reading of the data set key 205 in the data set key reading module 211 (1301). The data set key reading module 211 receives from the registration server 105 the data set key 303 for input data and the data set key 304 for an input stream which constitute the data set key 205 (1302).
Next, the data set key reading module 211 generates the execution area name reference table 215 (
In the case where the data set key 303 for input data and the data set key 304 for an input stream match (1304), the query registering module 111 ends the data set key reading module (1306).
In the case where the data set key 303 for input data and the data set key 304 for an input stream match do not match in Step 1304, on the other hand, the query registering module 111 generates the data set conversion table 210 (
After generating the data set conversion table 210, the query registering module 111 ends the data set reading module (1306).
Through the processing described above, the query registering module 111 compiles the stream definition 301 and the query definition 302, which are defined in the registration server 105, to generate an operator and the operator tree 228, and generates the data set conversion table 210 and the execution area name reference table 215 from the data set key 205, which is defined in the registration server 105. This processing can be executed when the query registering module 111 receives the stream and query definition 206 and the data set key 205 from the registration server 105.
The data inputting module 110 then determines whether there is the data set conversion table 210 or not (1403). When there is no data set conversion table, the data inputting module 110 stores the input data 201 to input data 203 in the input stream 214 (1405), outputs the input stream 214 to the data executing module 112, and then ends the processing of the input data receiving module 207 (1406).
When it is determined in Step 1403 that there is the data set conversion table 210, on the other hand, the data inputting module 110 executes the following processing. In the case where the data set conversion table 210 contains the data set key value 401 for input data that is one of the items of the input data 201 to input data 203 (1404), the data inputting module 110 stores the input data in a queue of the input data storage area 208 that is determined by the input data set key value 401 for input data (1408), and ends the processing of the input data receiving module 207 (1409).
To give an example, how the data set conversion table 210 and the input data storage area 208 look when the data inputting module 110 receives the input data 708 of
Here, “meter” is specified as a data set key for input data as in 303 of
In the case where the data set key value 401 for input data of the input data 201 to input data 203 is not found in the data set conversion table 210 in Step 1404, the data inputting module 110 stores the data set key value 401 for input data and data set key value 402 for an input stream of the received input data in the data set conversion table 210. The data inputting module 110 also generates a queue that is associated with the data set key value 401 for input data of the received input data in the input data storage area 208 (1407).
The data inputting module 110 then stores the input data in a queue of the input data storage area 208 that is determined by the data set key value 401 for input data (1408), and ends the processing of the input data receiving module 207 (1409).
Instead of the input data 201 to input data 203 from the sending servers 101 to 103, dummy data that has the data set key value 401 for input data, the data set key value 402 for an input stream, and a time may be input to the data inputting module 110. The input data receiving module 207 in this case stores the data set key value 401 for input data and data set key value 402 for an input stream of the dummy data in the data set conversion table 210, and adds a queue associated with the data set key value 401 for input data of the dummy data to the input data storage area 208, the same way the input data described above is processed.
Alternatively, dummy data that has the data set key value 401 for input data, the data set key value 402 for an input stream, a time, and an end flag may be input so that the input data receiving module 207 can execute the following processing. In an example of this processing, the input data receiving module 207 first reads the end flag. When the end flag has a given value and the data set conversion table 210 contains a data set which is extracted using the data set key value 401 for input data of the dummy data and to which the input data belongs, the input data receiving module 207 deletes the data set key value 401 for input data and data set key value 402 for an input stream of the dummy data from the data set conversion table 210. The input data receiving module 207 also removes the queue associated with the data set key value 401 for input data of the dummy data from the input data storage area 208. An entry of the data set conversion table 210 and a queue in the input data storage area 208 can be removed by using the dummy data described above. This prevents the amount of the memory 109 that is used by the data inputting module 110 from reaching an excessive level.
Queues in the input data storage area 208 can be generated in an order that data arrives at the data inputting module 110 as is the case for the queues 805 to 807 of
When it is determined in Step 1404 that there is the data set conversion table 210, processing of the partial chronological partition processing module 209 is started next (1410).
The partial chronological partition processing module 209 first compares time between pieces of data at the head of queues in the input data storage area 208 that have the same data set key value 402 for an input stream (1411). In this processing, the partial chronological partition processing module 209 compares the times (values of the time 702 of
Specifically, after the pieces of input data of
The partial chronological sort processing module 209 next determines, based on the result of the comparison described above, whether or not there is data having the earliest time (hereinafter referred to as oldest data) among pieces of input data that have the same data set key value 402 for an input stream in the input data storage area 208 (1412).
In the case where there is the oldest data (1412), the partial chronological sort processing module 209 obtains the oldest data from the queues 805 to 807 of the input data storage area 208 and stores the obtained data in the input stream 214 (1413). Steps 1411 to 1413 are repeated as long as there is the oldest data. When the oldest data is no longer found, the partial chronological partition processing module 209 moves to Step 1402 to receive new input data from the sending servers.
To give an example,
Through the processing described above, the partial chronological partition processing module 209 generates the input stream 214 by sorting, in ascending order of the time 702, pieces of input data of
When receiving the input data 708, the data inputting module 110 stores the input data 706 and the input data 708 in the input stream 214 as described above with reference to
First, the execution order determining module 217 (1501) obtains the input data 213 from the input stream 214 which has been output by the data inputting module 110. The execution order determining module 217 uses the data set key value 402 for an input stream of the input data 213 to extract an execution area name that is associated with the data set key value 402 for an input stream from the execution area name reference table 215, and sets an execution area having the extracted name as the execution area of a data set to which the input data 231 belongs. At this point, the execution order determining module 217 may check whether or not pieces of the input data 213 are arranged in chronological order in the same data set (the input stream 214).
The execution order determining module 217 next determines whether or not the execution area name reference table 215 contains the execution area associated with the data set key value 402 for an input stream of the input data 213 (1503).
In the case where the execution area associated with the data set key value 402 for an input stream is found in the table, the execution order determining module 217 stores the input data in the stream data storage queue 221 of the execution area to which the input data 213 belongs (1505).
When it is determined in Step 1503 that the execution area name reference table 215 does not contain the data set key value 402 for an input stream of the input data 213, on the other hand, the execution order determining module 217 generates an execution area that is associated with this data set key value 402 for an input stream, and adds this data set key value 402 for an input stream and the execution area name of the generated execution area to the execution area name reference table 215 (1504). The execution area generated by the execution order determining module 217 is set as the execution area of the data set to which the input data 213 belongs, and the input data 213 is stored in the stream data storage queue 221 of this execution area (1505). The ignition time reference table 222 of this execution area is an empty table devoid of entries such as the entry 1007. The execution state 223 of this execution area, too, is an empty area devoid of entries such as the entries 1011 and 1012. The ignition time table 222 and the execution state 223 are updated by the operation processing module 227.
In the case where the input data 706 is obtained from the input stream 214 of
Dummy data may be input instead. The execution order determining module 217 in this case generates an execution area that is associated with the data set key value 402 for an input stream of the dummy data, and adds the data set key value 402 for an input stream of the dummy data and the execution area name of the generated execution area to the execution area name reference table 215. Alternatively, dummy data that has an end flag may be input so that the execution order determining module 217 can execute the following processing. In an example of this processing, the execution order determining module 217 first reads the end flag. When the end flag has a given value and there is an execution area that is associated with the data set key value 402 for an input stream of the dummy data, the execution order determining module 217 removes the execution area and deletes the data set key value 402 for an input stream of the dummy data and the execution area name 502 of the removed execution area from the execution area name reference table 215.
The execution order determining module 217 next refers to the stream data storage queue 221 in the execution area of the data set to which the input data 213 belongs, and compares the time of data at the head of the queue (in the case of a query for processing data of a plurality of input streams, times of pieces of data at the head of a plurality of stream data storage queues) against the ignition time of each operator in the ignition time reference table 222. In the case where the comparison reveals that the stream data storage queue 221 contains data having the earliest time (1506), this data is set as execution data. When the execution data 224 is data at the head of the stream data storage queue 221, the execution order determining module 217 sets the first operator of the operator tree 228 as an execution operator. In the case of data whose operator ignition time is the current time, the execution order determining module 217 sets the operator associated with this ignition time as an execution operator (1507), and ends the processing of the execution order determining module 217 (1508).
In the case where the stream data storage queue 221 does not contain data having the earliest time in Step 1506, the execution order determining module 217 goes back to Step 1502 to obtain the next input data 213 from the input stream 214, and repeats the same processing as above.
For example, when the execution order determining module 217 stores the input data 706 in the stream data storage queue 1015 of the execution area B 219, the execution order determining module 217 compares a time “7:59” of the input data 706 against an ignition time “8:00” of the operator RANGE (1020) which is stored in the ignition time reference table 222. Because the time “7:59” of the input data 706 is earlier than the ignition time, the input data 706 is set as the execution data 224 and the first operator 601 of the operator tree 228 (
In the operator processing module 227, the execution operator 226 processes the execution data 224 with the use of the execution state 223 in an execution area that is associated with the data set key value 402 for an input stream of the execution data 224 (1509).
The data executing module 112 next performs processing of the data outputting module 230 for outputting the processing result of the execution operator 226 (1510). The data outputting module 230 first determines whether or not there is output data with respect to the processing result of the execution operator 226 (1511). When there is no output data, the data outputting module 230 proceeds to Step 1513 and ends the processing of the data outputting module 230. When there is output data, on the other hand, the data outputting module 230 executes Step 1512.
In the case where the processing result of the execution operator 226 is received as output data by the receiving server 117 or the input data receiving module 207, the data outputting module 230 merges the output data 229 which is the processing result of the execution operator 226 into a single stream instead of merging on an execution area-by-execution area basis (1512). The data outputting module 230 merges a plurality of pieces of output data 229 and outputs the merged data as the output stream 231. After the processing of the data outputting module 230 is finished (1513), the data executing module 112 resumes the processing of the execution order determining module 217 (1514).
The execution order determining module 217 determines in Step 1515 whether or not the operator tree 228 has the next operator (1515). The execution order determining module 217 proceeds to Step 1516 when the operator tree 228 has the next operator and, when the operator tree 228 does not have the next operator, returns to Step 1506 to repeat the processing described above. The execution order determining module 217 determines the next operator of the operator tree 228 as the execution operator 226 in Step 1516, and then returns to Step 1508 to repeat the processing described above.
In short, as long as there is a next operator in the operator tree 228 (1515), the data executing module 112 continues the processing by setting the next operator as the execution operator 226, setting as the execution data 224 the next data that belongs to the same data set (that has the same data set key value 402 for an input stream) and having the same time as the processed execution data 224, and using the execution state 223 in an execution area that is associated with the data set key value 402 for an input stream of the execution data 224.
When there is no more next operator in the operator tree 228 in Step 1515, the data executing module 112 extracts executable data in Step 1506, and processes the extracted data in the manner described above. In the case where executable data cannot be extracted in 1506, the data executing module 112 obtains input data from the input stream in 1502, and operates in the manner described above.
For instance, after operator processing is conducted with the input data 706 set as the execution data 224 and the operator RANGE (601) set as the execution operator 226 as illustrated in
The first embodiment described above is one way to carry out this invention which is based on operator scheduling disclosed in Patent Literature 4, and this invention can be carried out by various other methods. For instance, instead of separating the execution areas 218 and 219 by the data set key value 402 for an input stream, the ignition time reference tables 1004 and 1017 and the stream data storage queues 1002 and 1015 may be separated by the data set key value 402 for an input stream so that a separate execution state is set for each operator by defining the execution state in a query without separating the execution states 1008 and 1021 by the data set key value for an input stream.
The first embodiment shares a commonality with Patent Literature 4 in that the execution order determining module 217 refers to the ignition time reference table 222 and the stream data storage queue 221. However, the first embodiment differs from Patent Literature 4 in that the execution area name reference table 215, the execution area name 216, and the execution areas 218 and 219 are referred to, which is a configuration unique to this invention. Another unique configuration of this invention is that the data set key reading module 211, the data set key 205, the partial chronological partition processing module 209, the data set conversion table 210, and the data outputting module 230 are included in the first embodiment unlike Patent Literature 4.
As described above, the data set conversion table 210 is generated in this invention from the data set key 205 received by the stream data processing server 108, which processes the input data 201 to input data 203 each containing a time (hereinafter simply referred to as input data). The data set conversion table 210 has two keys, the data set key value 402 for an input stream (the first key) which defines the type (group) of input data to be processed in the same execution area, and the data set key value 401 for input data (the second key) which defines the item of input data to be input organized in chronological order. The query registering module 111 receives the stream and query definition 206 to generate the operator tree 228, and outputs the operator tree 228 to the data executing module 112.
The data inputting module 110 of the stream data processing server 108 generates the input stream 214 by grouping together pieces of input data by the data set key value 401 for input data and then sorting the grouped input data in chronological order for each data set key value 402 for an input stream.
The data executing module 112 sets the execution areas 218 and 219 in the memory 109 as areas for processing input data for each data set key value 402 for an input stream in the data set conversion table 210. In other words, an execution area is generated for each group (data set) of input data classified by the item of the data set key value 402 for an input stream.
The data executing module 112 determines an operator associated with input data so that the operator executes given query processing in the execution area, which is provided for each type (data set) of input data included in the input stream 214. The data executing module 112 then outputs the output stream 231.
Through the processing described above, the input data 213 can be processed by operators in the execution areas 218 and 219 provided in the memory 109 separately for different values of the data set key value 402 for an input stream which is the first key. This means that processing by an operator can be conducted while maintaining chronological order only for pieces of input data in one piece of stream data that have the same data set key value 402 for an input stream, which is the first key. Processing by an operator for pieces of input data that have different values as the data set key value 402 for an input stream, which is the first key, can thus be executed in each of the execution areas 218 and 219 of the data executing module 112 without waiting for input data that has not arrived in time. The processing latency is accordingly prevented from increasing while maintaining processing consistency.
In short, by sorting input data in chronological order for each data set before the data executing module 112 executes a query, the data inputting module 110 can determine the execution order of a plurality of pieces of input data received from the plurality of sending servers 101 to 103 the same way that the execution order is determined for data sets within one server.
This eliminates the need to use a control packet for the progress of time and the need to perform data discarding or similar processing as in the examples of the related art even when non-chronological input data is received, and prevents the processing latency from increasing while maintaining consistency in data processing. This also eliminates the need to secure a memory area while waiting for a control packet as in the examples of the related art, and therefore prevents the memory cost from increasing.
In addition, the data executing module 112 can dynamically generate execution areas in the memory 109 during stream data processing. In other words, even when stream data processing is started, the stream data processing server 108 does not generate an execution area until input data that corresponds to the data set key value 402 for an input stream is actually received. The stream data processing server 108 of this invention thus secures only execution areas necessary for stream data processing in the memory 109, and the increase in memory cost in the examples of the related art is therefore prevented in this invention.
A second embodiment of this invention is described next with reference to the drawings. The second embodiment discusses a method of carrying out this invention based on round robin operator scheduling that is disclosed in Non-Patent Literature 3.
The second embodiment differs from the first embodiment in that the data set key reading module 211 of the query registering module 111 generates an executable data reference table 1701 from the data set key 205 and from the operator tree 228 generated by the compiling module 212. The data inputting module 110 executes the same processing that is executed in the first embodiment. The execution order determining module 217 of the data executing module 112 uses the executable data reference table 1701 to extract the execution data 224 and the execution operator 226. As illustrated in
In processing of the query registering module 111 of the second embodiment, unlike the first embodiment, the executable data reference table 1701 is generated from the data set key 205 for an input stream received from the registration server 105, which transmits a user's input, and from operators included in an operator tree (which is generated by the compiling module) (2001). The generated executable data reference table 1701 may contain data such as data of the entries 1805 and 1806 of
Processing of the data inputting module 110 is the same as in the first embodiment.
In the processing of the data executing module 112, the execution order determining module (2107) keeps storing input data in a stream data storage queue following the procedures of 1502 to 1505 (which are the same as those in the first embodiment), as long as there is input data in the input stream 214 (2101).
In Steps 1502 to 1505, the execution order determining module 217 stores the input data 213 of the input stream 214 in the stream data storage queues 221 of the execution areas 218A and 219A, which are provided separately for each different data set key value 402 for an input stream, in the same manner as in
When there is no more input data 213 in the input stream 214 (2101 of
The data executing module 112 then compares the time of data at the head of the stream data storage queue 221 in the execution area of a data set to which the obtained input data 213 belongs (in the case of a query for processing data of a plurality of input streams, times of pieces of data at the head of a plurality of stream data storage queues, here, queues 1002 and 1015), and selects a stream data storage queue that contains data having the earliest time. The data executing module 112 updates the executable data reference table 1701 (
Specifically, the execution order determining module 217 extracts the input data 213 whose head data has the earliest time out of pieces of the input data 213 in the stream data storage queues 1002 and 1015, and updates the executable data reference table 1701 by writing a value that indicates that data is executable (for example, “∘”) in an entry of the executable data reference table 1701 that is associated with the extracted input data 213 and the operator tree 228.
The execution order determining module 217 refers to the updated executable data reference table 1701 and, in the case where executable data is found in one of the data sets (2103 of
When there is no more data that can be executed by the execution operator 226 in this data set, the execution order determining module 217 resets the associated entry of the executable data reference table (
In the case where the result of the processing of the execution operator 226 is to be obtained as output data by the receiving server 117 or the input data receiving module 207 (1511), the execution order determining module 217 merges the processing result into a single stream as in the first embodiment (1512).
The execution order determining module 217 repeats the processing of Steps 2103 to 1512 described above. In the case where the execution data 224 cannot be found in Step 2103, the execution order determining module 217 executes the processing of Step 2103 with the next operator of the operator tree 228 set as the execution operator 226 (2104). When there is no more next operator in the operator tree 228 in Step 2104, the execution order determining module 217 returns to Step 1502 to obtain the input data 213 from the input stream 214, and processes the obtained input data 213 in the manner described above.
For example, the stream data storage queues 221 of
As illustrated in
After the execution of the operator RANGE 601, there is no more data that can be executed by the operator RANGE 601. The execution order determining module 217 therefore sets the next operator in the operator tree 228 (
Lastly, the operator ISTREAM 603 processes execution data 1604, execution data 1606, and execution data 1602, and the execution order determining module 217 stores the results of the processing in the output stream as output data 1203, output data 1202, and output data 1204 as in the first embodiment.
The processing described above which is performed by the data executing module 112 is one of the round robin operator scheduling methods disclosed in Non-Patent Literature 3 that uses the same operator as the execution operator 226 to process data as long as the execution data 224 is found for the operator. Other scheduling methods than the one described above may also be implemented in this invention, such as a scheduling method in which the same operator is used as the execution operator 226 for a given period of time or for a given number of times, and a scheduling method in which the execution operator is selected at random from among operators for which executable data can be found.
The second embodiment shares a commonality with Non-Patent Literature 3 in that the execution order determining module 217 refers to the stream data storage queue 221, but differs from Non-Patent Literature 3 in that the execution area name reference table 215, the execution area name 216, the execution areas 218 and 219, and the executable data reference table 1701 are referred to. A feature of the second embodiment in terms of configuration is that, unlike Non-Patent Literature 3, the stream data processing server 108 includes the data set key reading module 211, the data set key 205, the partial chronological partition processing module 209, the data set conversion table 210, and the data outputting module 230.
A third embodiment is described next with reference to the drawings. Unlike the first embodiment and the second embodiment, the third embodiment does not change the stream data processing engine of the examples of the related art (corresponds to the data executing module 112 in this invention) and yet attains the same effect.
In the third embodiment, the user 204 specifies a maximum data set count 2301 of pieces of data processed in the query registering module 111 unlike the first embodiment and the second embodiment, and the specified maximum data set count 2301 of pieces of data processed is transmitted from the registration server 105 to the stream data processing server 108.
The query registering module 111 uses a stream and query duplicating module 2302 to receive the data set key 205, the stream and query definition 206, and the maximum data set count 2301 from the registration server 105, and to generate duplicate stream and query definitions 2303. The query registering module 111 uses the compiling module 212 to generate a plurality of operator trees 228 from the duplicate stream and query definitions 2303, and the operator trees 228 are respectively transferred to a plurality of data executing modules 112. While
Unlike the first embodiment and the second embodiment, the data inputting module 110 stores input data partially sorted in chronological order by the partial chronological partition processing module 209 in the input streams 214 of the plurality of data executing modules 112 that are indicated by a data set-stream association table 2305.
The input streams 214 are processed by the plurality of data executing modules (#1 to #3) 112 in the same manner that is used in conventional stream data processing systems, and pieces of the output data 229 which are the results of the processing are output to be stored in the output streams 231.
Lastly, the pieces of the output data 229 are obtained by a stream merging module 2306 from the output streams 231 of the plurality of data executing modules 112, and are stored in an output queue 2307.
The pieces of the output data 229 stored in the output queue 2307 are transmitted to the receiving server 117 or the input data receiving module 207.
The query registering module 111 then generates the data set conversion table 210 in Steps 1304 and 1305 as in
The compiling module 212 of the query registering module 111 compiles the duplicate stream and query definitions 2303 generated by the duplication, to thereby generate a plurality of operator trees 228 (2805). The query registering module 111 stores the generated plurality of operator trees 228 in the operator processing modules 227 in different data executing modules #1 to #3. For example, in the case where the stream definition 301 and the query definition 302 are set as illustrated in
The stream partition processing module 2304 uses the data set key value 2501 for an input stream of input data that has the earliest time to extract, from the data set-stream association table 2305, the input stream 2502 to which this data belongs (2905), and stores the data in the input stream 214 (2907). In the case where the input stream 2502 to which the data in question belongs is not extracted in Step S2905, the stream partition processing module 2304 adds to the data set-stream association table 2305 the data set key value 2501 for an input stream of this data and the name of the input stream 2502 that has not been allocated a data set (2906), and stores the data in the added input stream (2907). To store the data, the input stream 2502 provided for each data set key value 2501 for an input stream is processed by one of the plurality of data executing modules, #1 to #3, that is associated with the input stream 2502 as in conventional stream data processing systems, and the result of the processing is stored in the output stream 231.
The third embodiment described above can attain the same effect as that of the first embodiment without changing the conventional stream data processing engine (the data executing modules #1 to #3). Alternatively, instead of specifying the maximum data set count 2301, a stream definition and a query definition may be duplicated dynamically to deal with an increase in the number of data set key values 2501 for an input stream when the data executing modules #1 to #3 execute processing. Streams and queries obtained by the duplication are compiled to be registered in the generated operator trees 228.
The first to third embodiments have discussed an example in which the data set key 205, the stream and query definition 206, and other types of settings information are received from the registration server 105. Alternatively, the stream data processing server 108 may be provided with an input device to receive the settings information from the input device.
A detailed description of this invention has now been given with reference to the accompanying drawings. However, this invention is not limited to the concrete configuration given above, and encompasses various modifications and equivalent configurations that fall within the scope of claims set forth below.
This invention is applicable to a computer system that performs stream data processing on input data containing time.
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/JP2010/067587 | 10/6/2010 | WO | 00 | 4/30/2013 |