This non-provisional application claims priority under 35 U.S.C. §119(a) on Patent Application No(s) 103142111 filed in Taiwan, R.O.C. on Dec. 4, 2014, the entire contents of which are hereby incorporated by reference.
The technical field of the invention relates to big data, and more particularly to a system and method for providing an instant query.
The advancement in electronic technologies leads to the explosive growth of data, which deteriorates the situation when users perform instant queries of the data for a wide range of applications. Conventional information technology (IT) systems are usually unable to provide instant query services due to the multi-tier design of system architecture or the limited access speed of a storage device such as a hard disk.
As shown in
More and more applications, e.g. connection confirmations or alerts of factory problems, require instant queries of the data streams. Given the circumstances, if the data are stored in the database or storage system using hard disk as medium 107, the access speed will become a troublesome bottleneck. Take the semiconductor industry as an example. Any warning or alert triggered by the analysis of the data streams collected during a manufacturing process should be handled in a real-time manner. Thus, it will be more than helpful if users can perform an instant query of the data streams through a server for query 101 and quickly take a corresponding action. Most of the IT systems nowadays fail to provide such instant query of the data streams.
The aforementioned approach involves multiple tiers of a physical structuring mechanism for the system infrastructure, thus the data transmission between tiers is time-consuming. To be more specifically, the continuously-generated data streams 231, 233, 235 and 237 are stored in the equipment 211, 213, 215 and 217, respectively. Subsequently, the data streams 231, 233, 235 and 237 are transmitted to the network storage device 25, the cluster 27 and the data warehouse 28. The multi-tier infrastructure design disclosed herein and the use of hard disks as the storage medium lead to ineffective queries—users fail to get the latest responses to their queries of the data streams within tolerable windows, e.g. from one second to up to a few seconds.
In view of the problems stated above, a primary objective of the invention is to provide a system and method with an infrastructure design that enables an application to perform instant queries of continuously-generated data streams.
In an exemplary embodiment of this disclosure, a system for an instant query is disclosed. The system comprises a dispatcher, a data processor and a storage system. The dispatcher receives data streams from multiple machines and transmits the data streams to a network storage device which creates a backup of the data streams. On the other hand, the dispatcher creates a replica of the data streams and transmits the replica to the data processor. The data processor processes the replica according to predetermined rules and an output is generated in consequence. The output is transmitted to and stored in the storage system and is provided for an application to perform the instant query via an interface of the information technology system.
The data streams mentioned above can be logs that are continuously generated by the machines. Instead of storing the logs, the machines transmit the logs to the dispatcher through a communication protocol. The dispatcher creates two copies of the data streams, in which one copy is transmitted to the storage system and the other copy is transmitted to the data processor. The data processor comprised of at least a data refiner obtains a subset from the copy of the data streams according to a predetermined rule. The data processor processes the received copy of the data streams and the output is derived therefrom. Thereafter the output is transmitted back to the dispatcher. The dispatcher filters specific attributes of the data streams and based thereon forwards the output to a second data processor, which processes the output to generate a second output. The second output is transmitted to and stored in the storage system.
The storage system can be a non-relational database. The dispatcher and the data processor can be program modules installed in dedicated hardware components such as a master node and a worker node of a cluster, respectively.
The present invention also discloses a method for an instant query of continuously-generated data streams with properties of high volume, high variety and high velocity. With the disclosed invention, the instant queries which bring new possibilities of applications can be carried out while traditional procedures of data storage, archiving and query are set in place.
The communication protocol mentioned above can be communication protocols such as FTP, Syslog or any other protocol capable of transmitting the data streams generated by the machines 311, 312, 313 and 314 to the dispatcher 32. To facilitate the data transmission, some courses of actions can be carried out beforehand, including but not limited to settings of IP addresses, accounts and passwords for the machines 311, 312, 313 and 314, and installations of one or more software programs on the machines 311, 312, 313 and 314. The machines 311, 312, 313 and 314 can either actively and continuously transmit the data streams to the dispatcher 32, or transmit the data streams after receiving a request from the dispatcher 32.
The dispatcher 32 comprises functions of filtering and forwarding, and is executed on a master node of a cluster (not shown). In an embodiment of the present invention, the cluster comprises the master node and at least two worker nodes, wherein the dispatcher 32 is loaded into a memory of the master node and then executed. The dispatcher 32 can either allocate a buffer capacity for storing the data streams or forward the received data streams in real time. To ensure the availability and stability of the dispatcher 32, the physical structuring of the master node can be designed to provide fault tolerance, redundancy and load balance.
After receiving the aforementioned data streams, the dispatcher 32 replicates a first copy of the data streams and transmits the first copy to a network storage device 36 for data pre-processing including data extraction, transformation and loading. The first copy of the data streams being pre-processed is then transmitted to a data warehouse 391 and then accessed by an application server 392.
The dispatcher 32 can create a second copy of the data streams and transmit the second copy to a data processor 34. In an embodiment of the present invention, the data processor 34 is loaded into a memory of the worker nodes and executed.
The data processor 34 processes the second copy of the data streams based on a predetermined rule. For example, the predetermined rule may require the data processor 34 to obtain a subset of the second copy of the data streams such as 5 specific columns selected out of 20 columns. According to the predetermined rule, the data processor 34 processes the second copy of the data streams and thus acquires a processed output (not shown). More plugin functions can be added to the data processor 34 in order for fulfilling user needs.
The processed output is transmitted to a non-relational database 35 such as NoSQL. In an embodiment, the non-relational database 35 is designed to run on the aforementioned worker node such as its memory. The processed output in the non-relational database 35 is provided for a third-party application server 37 to carry out an instant query. It is worth noticing that the dispatcher 32 and the data processor 34 can handle the aforementioned data streams in real time and the processed output is directly transmitted to the non-relational database 35, none of which involves any time-consuming step such as writing the data streams to a hard disk. In addition, the architecture disclosed herein is flatter than the prior art described in
In one embodiment, the first refiner 451 processes the part or the whole of the fourth copy of the data stream 42 received from the dispatcher 44 according to predetermined rules and consequently generates a first data output (not shown) which is subsequently transmitted to a non-relational database 46 such as a NoSQL. In another embodiment, the first data output is transmitted back to the dispatcher 44, which filters a second set of specific attributes of the first data output and based thereon forwards a part or the whole of the first data output to the second refiner 452 for further processing, according to the rules. After being further processed by the second refiner 452, the first data output is now turned to a second data output (not shown) and transmitted to the non-relational database 46. In still another embodiment, the second data output is transmitted back to the dispatcher 44, which filters a third set of specific attributes of the second data output and based thereon forwards a part or the whole of the second data output to the third refiner 453 for even-further processing, which turns the second data output to a third data output. The third data output is transmitted to the non-relational database 46. The processes described herein are for example and thus not meant for any limitation of the scope of the present invention. Any addition, deletion, combination, or change of the data processors or data dissemination is within the invention scope. In one embodiment the third refiner 453 executes a function different from that of the first refiner 451, while in another embodiment the third refiner 453 serves as a redundant element of the first refiner 451 so as to provide the availability.
The dispatcher 44 is executed on a master node of a cluster (not shown). In an embodiment, the cluster comprises the master node and at least two worker nodes. The dispatcher 44 is loaded into a memory of the master node and executed, while the data processors, i.e. the first refiner 451, the second refiner 452 and the third refiner 453, are loaded into memories of the worker nodes and then executed.
The system for storing and processing the data streams 51 can be replaced with a big data platform such as a platform that runs Hadoop framework, which stores and processes large data sets with a fully distributed mode. In an embodiment, the data streams are stored in a HDFS file system 531 that is highly fault-tolerant, pre-processed by MapReduce 532 and then stored in HIVE or Impala 533 acting as a data warehouse. The data streams stored in HIVE or Impala 533 are provided for queries (not shown) and/or presented in charts, in tables, on dashboards 534 or on websites, for example.
Steps [S604] to [S606] are disclosed in detail. After receiving the second replica from the master node, the worker node processes the second replica according to the predetermined rules and generates the first output, provided for the external application to carry out the instant query. The processing of the second replica is executed in a memory of the worker node. For a purpose of further processing, the first output is transmitted back to the master node. The master node then filters a fifth set of specific attributes of the first output and based on which to forward a part or the whole of the first output to the worker node. Then, the worker node further processes the part or the whole of the first output and a second output is generated in consequence. For a purpose of even-further processing, the second output is transmitted back to the master node. Then master node filters a sixth set of specific attributes of the second output and based on which to forward a part or the whole of the second output to the worker node. The worker node carries out the even-further processing of the part or the whole of the second output; and as a consequence a third output is generated. The steps described herein are for example and not for any limitation of the present invention.
Although a variety of examples and other information was used to explain aspects within the scope of the appended claims, no limitation of the claims should be implied based on particular features or arrangements in such examples, as one of ordinary skill would be able to use these examples to derive a wide variety of implementations. Further and although some subject matter may have been described in language specific to examples of structural features and/or method steps, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to these described features or acts. For example, such functionality can be distributed differently or performed in components other than those identified herein. Rather, the described features and steps are disclosed as examples of components of systems and methods within the scope of the appended claims.
Number | Date | Country | Kind |
---|---|---|---|
103142111 | Dec 2014 | TW | national |