This invention relates to a technology of creating a template for a query for processing stream data.
Stream data processing is known as a technology of processing data from a multitude of sensors, and data related to settlement and buying and selling of financial organizations or other similar entities. In stream data processing, a query is registered in a system first and, when data arrives, the query is executed continuously. Continuous Query Language (CQL) is a favorable example of a language in which the query is written.
There has been known a technology of creating a template for a stream data processing query that is written in CQL in order to expand the range of use of stream data processing (for example, US 2011/0093490 A1).
In the technology of US 2011/0093490 A1, however, the schema of input stream data that is defined in the template is fixed. The schema of the template therefore needs to be modified depending on the type of the data source when a large quantity of information as in a social networking service (SNS), a blog, or the like is used for input stream data. Specifically, the schema of a template that has information of one SNS as input stream data differs from a schema for information of other SNSs and, accordingly, it is necessary to redefine the template in the language in which the query is written, or to prepare numerous templates in advance.
Redefining a template in the language in which the query is written requires a person capable of programming a query, and not all users who use stream data processing possess that ability. Preparing numerous templates in advance has a problem of increasing the work and cost of software engineers and the like.
This invention has been made in view of the problems described above, and an object of this invention is therefore to cut the cost of developing a template for a query by receiving a plurality of inputs without preparing numerous templates.
A representative aspect of this invention is as follows. A query generating method for generating a query for processing input stream data, the query generating method being performed by a computer comprising a processor and a memory, the query generating method comprising: a first step of reading, by the computer, a template in which the input stream data is divided into an essential column and an option column, and processing to be executed for the essential column is defined; and a second step of generating, by the computer, a query for dividing the input stream data into the essential column and the option column, for processing the essential column by using the template, and for outputting a result of the processing of the template and the option column as one piece of data.
According to this invention, input stream data is divided into an essential column and an option column, and the essential column on which processing of a template has been performed is combined with the option column. Receiving inputs of a plurality of types with the use of a single template is thus accomplished, and the cost of developing a template can be reduced by keeping the number of template types small.
Embodiments of this invention are described below with reference to the accompanying drawings.
The stream processing executing server 101 includes a CPU 104, which executes computing processing, a memory 102, which holds data and programs, storage 105, which stores programs and data, and an I/O interface 106, which is coupled to the network 110. A stream data processing engine 103 in the form of a program is loaded onto the memory 102 and executed by the CPU 104. The stream data processing engine 103 can be stored in the storage 105.
The stream data processing engine 103 processes stream data received from the data source 140 by, as described later, continuously executing the relevant stream processing query 700 generated by the query generating server 107. Continuous Query Language (CQL) described above, for example, can be used for the stream processing queries 700. The following description takes as an example a case in which the stream processing queries 700 are written in CQL.
The query generating server 107 includes a CPU 121, which executes computing processing, a memory 122, which holds data and programs, storage 123, which stores programs and data, and an I/O interface 124, which is coupled to the network 110. A template registering module 108 and a query generating module 109 in the form of a program are loaded onto the memory 122 and executed by the CPU 121. The storage 123 stores templates 111, pieces of template configuration information 112, stream processing definitions 500, and the stream processing queries 700. The template registering module 108 and the query generating module 109 in the form of a program can be stored in the storage 123.
The template configuration information 112 and function modules of the query generating module 109 are loaded in the form of a program onto the memory 122. The CPU 121 executes processing as programmed by the respective programs of the function modules, to thereby operate as function modules that provide given functions. For example, the CPU 121 executes processing as programmed by a template registering program, to thereby function as the template registering module 108. The same applies to other programs. The CPU 121 further operates as function modules that provide functions of a plurality of processing procedures executed by each program. A computer and a computer system are an apparatus and a system that include those function modules.
Programs, tables, and other types of information for implementing the functions of the query generating server 107 can be stored in the storage 123, or in a non-volatile semiconductor memory, or in a storage device such as a hard disk drive or a solid state drive (SSD), or in a computer-readable, non-transitory data storage medium such as an IC card, an SD card, or a DVD.
In main processing of the query generating server 107, the template registering module 108 sets the templates 111 and stores the templates 111 and the template configuration information 112 in the storage 123. When a stream processing definition is input, the templates 111 and the template configuration information 112 are used by the query generating module 109 to generate the stream processing queries 700.
The terminal 130 is a computer that includes a CPU, a memory, storage, an I/O interface, and an input/output apparatus (not shown), and is operated by a user or an administrator.
The stream processing query 700 divides input stream data into two types of data by extracting an essential column, which includes text, from the stream data and extracting an option column from the stream data. The stream processing query 700 at this point assigns an identifier that associates the essential column and the option column with each other (701). In the example of
The stream processing query 700 then executes template processing to check whether the essential column partially matches a letter string that is a given keyword (“keyword”), and outputs the essential column that includes the given keyword (702). The stream processing query 700 uses a given window operator to combine the output of the letter string partial matching processing with option column data whose text ID matches the text ID of the output (703). In the example of
In this invention, an essential column, which includes essential text, is extracted from input stream data and other portions of the input stream data than the text of the essential column is separated as an option column. The essential column is processed by given processing (702) with the use of one of the templates 111, and the output of the template 111 is then combined with the option column.
In this manner, only the essential column needs to be defined in each template 111 in order to apply the template 111 to stream data that has a different schema. In addition, the option column can be handled as metadata. The option column may be input stream data itself, or may be data that is obtained by subtracting the essential column from input stream data.
The template calling information generating module 202 obtains configuration information of the templates 111 (the template configuration information 112) written in the stream processing definitions 500, and generates the template calling information 203, which indicates for each template 111 the relation between input stream data and output stream data.
The combining processing inserting module 204 generates the stream processing query 700 by determining the output column to be combined and the window size based on the stream processing definitions 500 and the template configuration information 112.
An example of the templates 111 used in this embodiment is shown in
The template 111-1 defines a query that combines inquiry results of two SELECT statements. The query defined by the template 111-1 combines an inquiry in which, when the value of an essential column “str” includes a letter string specified by “$key”, the value of “extracted” is the letter string specified by “$key” with an inquiry in which, when the value of the essential column “str” does not include the letter string specified by “$key”, the value of “extracted” is an empty letter string.
The template 111-2 defines a query that combines inquiry results of two SELECT statements. The query defined by the template 111-2 combines an inquiry in which, when the value of the essential column “str” matches the letter string specified by “$key”, the value of “extracted” is the letter string specified by “$key” with an inquiry in which, when the value of the essential column “str” does not match the letter string specified by “$key”, the value of “extracted” is an empty letter string.
The templates 111-1 and 111-2 are collectively denoted by a symbol 111 in the following description.
The template configuration information 112-1 includes a field for a name 1121 in which the name (or function name) of the template 111-1 is stored, a field for an input schema 1122, which corresponds to the essential column, a field for an output schema 1123, which indicates an output from the template 111-1, a field for an ID 1124 in which an identifier is stored, and a field for a combining window size 1125 in which the window size in combining processing is stored.
The input schema 1122 corresponds to an essential column 2034 of the template calling information 203 which is described later, and the output schema 1123 corresponds to an output column 2036 of the template calling information 203.
The template configuration information 112-1 and 112-2 are collectively denoted by a symbol 112 in the following description.
As described, the templates 111 and template configuration information 112 of this invention define only a letter string (STRING) as the essential column of the input schema 1122, which allows the system to handle data of various SNSs and a diversity of blogs as input stream data.
Each stream processing definition 500 defines the name and configuration of stream data that is input in a stream definition 501. In the example of
The stream processing definition 500 further defines that two templates 111 are to be called in template calls 502 and 503. The template call 502 indicates that a template whose call name is “twitter_keyword” and whose type (or function) is “string_part_match” (letter string partial matching processing) is called in “CALL TEMPLATE”. The template call 502 indicates that, in the template having the call name “twitter_keyword” (111-1 of
In the template call 503, the template (111-2 of
The option column of the template having the call name “twitter_keyword” includes other columns than the essential column “text” out of the columns in the stream definition 501, namely, the columns “msgID”, “time”, and “userID”. The option column of the template having the call name “twitter_keyword_influencer” includes other columns than the essential column “userID”, namely, the columns “msgID”, “time”, “text”, and “keyword”. The columns “msgID”, “time”, “text”, and “userID” constitute the input schema of the template having the call name “twitter_keyword_influencer”.
The stream processing definitions 500 thus define for each template 111 stream data that is input and stream data that is output.
Each single record of the template calling information 203 includes a field for a template call name 2031 which stores the call name of one of the templates 111 in the stream processing definition 500 of
The values of those fields 2031 to 2036 may be extracted from the stream definition 501 and definitions of the template calls 502 and 503 of
The query generating module 109 defines in 712 of
The query generating module 109 next reads the template call 502 of the read stream processing definition 500 and the template 111-1 to deploy the specifics of “string_part_match” of the template 111-1 in the stream processing query 700 (713). The query generating module 109 inserts a combining query definition that combines the output column of the template “string_part_match” with the option column (714). The insertion of the combining query definition is executed by the combining processing inserting module 204 of
In 715 to 717 of
The query generating module 109 thus generates the stream processing query 700 from the two templates 111-1 and 111-2 that are included in the read stream processing definition 500.
Details of the processing that is executed by the query generating module 109 of
The template calling information generating module 202 of the query generating module 109 reads the stream processing definitions 500 specified in the query generation request out of the storage 123 (902). The template calling information generating module 202 next extracts the templates 111 that are included in the stream processing definitions 500. The template calling information generating module 202 reads configuration information of the extracted templates 111 (the template configuration information 112) out of the storage 123 (903). The templates 111 extracted from the stream processing definitions 500 may be the templates 111 that are written in “CALL TEMPLATE” as in the template calls 502 and 503 of
The template calling information generating module 202 determines for each read piece of the template configuration information 112 whether or not the template calling information 203 is registered in the memory 122 (904).
In the case where the template calling information 203 is already registered for every read piece of the template configuration information 112, the template calling information generating module 202 ends the processing (907).
In the case where the template configuration information 112 for which the template calling information 203 has not been registered is found, the template calling information generating module 202 generates the template calling information 203 for each found piece of the template configuration information 112, and stores the generated information in the memory 122 in Steps 905 and 906.
First, in Step 905, the template calling information generating module 202 obtains from the stream processing definitions 500 information about a template for which the schema of input stream data has been established. With the schema of input stream data established, input schemata and output schemata are tracked starting from the template 111 that has the stream definition 501 in the stream processing definitions 500 of
In Step 906, the template calling information generating module 202 sets the group of columns included in the output column 2036 and the option column 2035 as the schema of input stream data of the next template, which has the output stream data of the current template 111 as an input. In other words, the output schema of the preceding template 111 is established and the template 111 that has the established output schema as an input is set as the next processing target. The template calling information generating module 202 then returns to Step 904 to repeat the processing described above for every read piece of the template configuration information 112.
Through the processing described above, the template calling information 203 is generated for the template configuration information 112 of each template written in the stream processing definitions 500 while establishing input schemata and output schemata. In other words, the processing is executed sequentially from the template 111 for which the output schema of its preceding template has been established. The template calling information 203 may be stored in the storage 123.
The combining processing inserting module 204 first reads the stream processing definitions 500, the template configuration information 112, and the template calling information 203 (1001 and 1002). The combining processing inserting module 204 determines whether or not the generation of the ID assigning query, the in-template query, and the combining query has been completed for every template 111 written in the stream processing definitions 500 (1003). In the case where the generation processing has been completed for every written template 111, the combining processing inserting module 204 ends this combining processing (1008). In the case where the template 111 for which the generation processing has not been completed is found, the combining processing inserting module 204 repeatedly executes Steps 1004 to 1006 until every written template 111 has been processed.
The combining processing inserting module 204 extracts the template 111 for which the ID assigning query, the in-template query, and the combining query have not been generated (1004). The combining processing inserting module 204 executes ID assigning query definition generating processing (an ID assigning query definition generating module) shown in
Details of processing of generating the ID assigning query, the in-template query, and the combining query for each template 111 are described below. The combining processing inserting module 204 includes the ID assigning query definition generating module, the in-template query definition generating module, and the combining query definition generating module, and is at the center of the execution of the following processing.
The combining processing inserting module 204 generates the definition of a query for assigning the input stream data an identifier that uniquely associates the input stream data with the output of the template 111 (for example, textID of
Through the processing described above, the combining processing inserting module 204 generates a query for assigning the input stream data an identifier that uniquely associates the input stream data with the output of the template 111 as the ID assigning query definition of the called template 111. The ID assigning query definitions in 712 of
The combining processing inserting module 204 reads a query written in the called template 111 (2602). The combining processing inserting module 204 defines input stream data of the called template 111 which is included in the read query as the output of the ID assigning query generated in
The combining processing inserting module 204 generates the definition of the in-template query through the processing described above, and then ends the processing (2605). The query definitions in 713 of
The combining processing inserting module 204 determines the window size of the combining query. The NOW window is set as the window size for the combining of the template 111 with output stream data of the template 111. The window for data that is simply input stream data to which an ID has been assigned (the option column) as illustrated in
The combining processing inserting module 204 determines the output column of the combining query. The combining processing inserting module 204 determines, as the output column of the combining query, other columns of the output stream data of the template 111 than the ID column and the option column out of input stream data of the template 111 (1203). Columns to be combined as illustrated in
The combining processing inserting module 204 next determines a combining condition of the combining query. For example, such a combining condition is determined that an ID assigned to input stream data (option column) of the template 111 (strID=textID) matches an ID included in output stream data of the template 111 (strID=textID) as illustrated in Step 703 of
The combining processing inserting module 204 uses the determined window size, output column, and combining condition to determine a SELECT statement, a FROM statement, and a WHERE statement, and thus generates the combining query (1205).
Through the processing described above, the definition of the combining query for combining input stream data and output stream data of the template 111 is generated, and the processing is ended (1206). The query definitions in 714 of
By executing the processing described above of
The terminal 130 transmits a stream processing request in which one of the stream processing queries 700 is specified to the stream processing executing server 101. The stream processing executing server 101 obtains the specified stream processing query 700 from the query generating server 107, and executes the stream processing query 700 with the use of the stream data processing engine 103. The stream processing executing server 101 receives stream data from the data source 140 and uses the stream processing query 700 to execute given processing.
In this invention, where each template 111 and template configuration information 112 define only a letter string (STRING) as the essential column of the input schema 1122 as shown in
A single template 111 can thus receive a plurality of inputs, instead of preparing numerous templates, and the cost of developing a template for a query is accordingly reduced.
In addition, when text data of a new service is used, the existing template 111 can be applied instead of creating a new template 111. This enables a user with a low program developing ability to use stream data easily. Second Embodiment
The template registering module 108 of the second embodiment receives as an input the ID-unassigned template 111A and the partial template configuration information 112A in which the ID and the window size are undetermined, and generates the template 111 and the template configuration information 112, which include an ID (strID) and a window size as in the first embodiment, in a manner described later.
For that purpose, the template registering module 108 has an automatic ID assigning module 1081, a parser (parsing module) 1082 of the stream data processing engine 103 of the stream processing executing server 101, and a window size calculating module 1083 as shown in
The automatic ID assigning module 1081 of the template registering module 108 of
For example, “id” is not defined in the SELECT statements in the ID-unassigned template 111A of
When assigning an ID to the template 111A is possible, the automatic ID assigning module 1081 of the template registering module 108 also assigns “id” as the ID 1124 in the partial template configuration information 112A, in a manner described later. The window size 1125 is set to “NOW” by the window size calculating module 1083 of the template registering module 108 when the template 111-3 (111A) fulfills a given condition, thereby generating the template configuration information 112-3.
The automatic ID assigning module 1081 of the template registering module 108 analyzes the read ID-unassigned template 111A to determine whether or not an ID can be assigned as described later. When assigning an ID is possible, the automatic ID assigning module 1081 assigns an ID to the ID-unassigned template 111A and the partial template configuration information 112A (1903). When assigning an ID is not possible, the automatic ID assigning module 1081 notifies the terminal 130 of the fact that no ID can be assigned.
The window size calculating module 1083 of the template registering module 108 analyzes the read ID-unassigned template 111A to determine a window size that is used when the option column and the output stream data are combined (1904). In the case where determining the window size is not possible, the window size calculating module 1083 notifies the terminal 130 of the fact that the window size cannot be determined.
The template registering module 108 stores in the storage 123 the template 111-3 to which an ID has been assigned and the template configuration information 112-3 in which an ID and a window size have been set (1905).
Through the processing described above, the ID-unassigned template 111A and the partial template configuration information 112A are received and, when the ID-unassigned template 111A fulfills a given condition, the template 111-3 and the template configuration information 112-3 are generated and stored in the storage 123 (1906).
The automatic ID assigning module 1081 uses the parser 1082 of the stream data processing engine 103 to parse the ID-unassigned template 111A, and generates an operator tree (2002).
In Step 2003 of
In Step 2005, the automatic ID assigning module 1081 adds an Id column to the SELECT statement of every query definition in the ID-unassigned template 111A to generate the template 111-3. The template 111-3 of
In Step 2006, the automatic ID assigning module 1081 generates the template configuration information 112-3 by registering an Id in the field for the ID 1124 of the partial template configuration information 112A.
The automatic ID assigning module 1081 generates the template 111-3 and the template configuration information 112-3 through the processing described above, and then ends the processing.
The window size calculating module 1083 determines whether or not a query definition in which the SELECT statement includes a column corresponding to an ID and stream operations include RSTREAM and DSTREAM is found among query definitions of the template 111-3 (2102). In other words, the window size calculating module 1083 removes RSTREAM and DSTREAM, which lead to a delay in output stream data, in order to trace the ID assigned in the template 111-3 accurately. The window size calculating module 1083 proceeds to Step 2104 when the operations of the template 111-3 cause a delay, and to Step 2103 when a delay is not caused.
The window size calculating module 1083 next analyzes the template 111-3 to determine whether or not the template 111-3 has a query definition in which the SELECT statement includes a column corresponding to an ID and JOIN is included (2103). In other words, the window size calculating module 1083 removes a query definition that includes JOIN because a query definition that includes JOIN poses a problem of which ID to select from among a plurality of IDs of pieces of data to be joined. The window size calculating module 1083 proceeds to Step 2104 when a query definition that includes JOIN is found, and otherwise proceeds to Step 2105.
In Step 2105, the window size calculating module 1083 sets the combining window size 1125 in the template configuration information 112-3 to “NOW”.
In Step 2104, the window size calculating module 1083 sends to the terminal 130 an error message to the effect that the window size to be used in the combining cannot be determined, and terminates the processing.
Through the processing described above, the window size in the combining is set to “NOW” when the template 111-3 fulfills a given condition, and the determined window size is set in the template configuration information 112-3 (2106).
As described above, the template 111-3 and the template configuration information 112-3 can be generated automatically from the ID-unassigned template 111A and the partial template configuration information 112A in which the window size is undetermined in the second embodiment and, accordingly, the work of a user or an administrator who operates the terminal 130 can be further reduced.
The query generating module 109 receives one of the stream processing definitions 500 and uses the template calling information generating module 202 to generate the template calling information 203 in the same manner as in the first embodiment. The query generating module 109 next uses the option column inserting module 205 to define a query for inserting the option column in the result of the processing of the template 111, and generates a stream processing query 700A. In the third embodiment, an ID(=strID) assigned to the essential column and the option column is used to determine a place where the option column is inserted.
The stream processing query 700A is similar to the stream processing query of
In 713A of
The template 111-2 of
The option column inserting module 205 first reads the stream processing definition 500, the template configuration information 112, and the template calling information 203 (2501 and 2502). The option column inserting module 205 determines whether or not the option column has been added to every template 111 written in the stream processing definition 500 (2503). The option column inserting module 205 ends the processing of
The option column inserting module 205 extracts the template 111 to which the option column has not been added (2504). The option column inserting module 205 executes the ID assigning query definition generating processing (ID assigning query definition generating module) described in the first embodiment with reference to
The option column inserting module 205 next executes the in-template query definition generating processing described in the first embodiment with reference to
The option column inserting module 205 next generates the definition of a query that has output stream data of the template 111 as an input and that removes, from the input stream data, an ID that is uniquely associated with the input stream data (an ID removing query). The column name of the ID is the ID in the template configuration information 112.
Through the processing described above, output stream data can be obtained in which the option column has been added to the essential column processed by the template 111.
In the fourth embodiment, the query generating module 109 can generate the definition of a query for keeping the option column for the duration of a given time window by taking into account a delay due to the processing of the template 111, and for sequentially combining output stream data that has undergone the processing of the template 111 with the option column.
In this embodiment, the query generating module 109 executes the functions and processing described in the first embodiment with reference to
In
Similarly, stream data processing and the window size are changed to DSTREAM and five minutes, respectively, in a query definition 716B of
The stream processing query 700B described above combines the output stream and option column of the processing of the template 111-4, which is a template “string_part_match—2m_delay”, in a two-minute window, combines the output stream and option column of the processing of the template 111-5, which is a template “string_match—5m_delay”, in a five-minute window, and outputs the resultant output streams.
Through the processing described above, a time required for processing in each template 111 is taken into account so that a delay in output stream can be tolerated.
The computers, processing units, and processing means described related to this invention may be, for a part or all of them, implemented by dedicated hardware.
The variety of software exemplified in the embodiments can be stored in various media (for example, non-transitory storage media), such as electro-magnetic media, electronic media, and optical media and can be downloaded to a computer through communication network such as the Internet.
This invention is not limited to the foregoing embodiments but includes various modifications. For example, the foregoing embodiments have been provided to explain this invention to be easily understood; they are not limited to the configurations including all the described elements.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/JP2013/084690 | 12/25/2013 | WO | 00 |