1. Field of the Invention
The present invention relates to a monitoring system and monitoring method, in particular further relates to a monitoring system and a monitoring method for avoiding the monitoring mechanism from failing when a single server or a single database in the cloud data center fails.
2. Description of Related Art
Generally speaking, a cloud data center is equipped with various kinds of hosts, such as Physical Machine (PM), Virtual Machine (VM), switches, routers, uninterruptible power supplies, UPSs, firewalls etc. for respectively processing different data.
In order to manage and monitor the status of the data center at ease, the administrators typically install sensors by means of hardware or software in the host for monitoring the all kinds of host data, for example, temperatures, humidity, fan rates, CPU, memory, network status, hardware capacity etc. The detected data is periodically reported and saved in a database of the data center. The administrators further access the database for monitoring all kinds of host data in the data center.
Currently, the data centers are connected to each host via single monitoring server and database. Thus, each host respectively detects the host data of its own, the single monitoring server monitors the host data, and the single database saves the host data. Though, the host is required to detect the data of its own continuously, and periodically reports the data to the monitoring server, and saves the data in the database. Accordingly, when the quantity of the hosts in the cloud data center is large, the report frequency is too high, or the data traffic reported at the same time is too large, the monitoring server or the database may be overloaded which results in data loss. As mentioned above, typically cloud data center usually installs single monitoring server and database. Accordingly, when the monitoring server or database is damaged, the monitoring mechanism of the cloud data center fails too.
Further, when the quantity of the hosts in the cloud data center is large, the stotage space of the database may become insufficient, administrators have to ad-hoc add the database capacity which is inconvenient to operate.
The objective of the present invention is to provide a system for managing and monitoring cloud hosts and method thereof. The distributed plurality of monitoring servers are used for respectively monitoring, saving and processing corresponding data so as to assure that when single server or single database damages, the monitoring mechanism of the cloud data center does not fail.
In order to achieve the above objective, the present invention provides a monitoring system comprising a cloud host and a plurality of monitoring servers. Each monitoring server respectively is used for processing data of different categories. The cloud hosts detect each host status of its own for generating a plurality of status data. The plurality of status data respectively records data of different categories. Next, the cloud hosts respectively transfer the status data of different categories to the corresponding monitoring server. The plurality of monitoring servers save the status data of the cloud hosts by the categories, and respectively execute following processing steps.
Compare with related art, the present invention offers advantage is that a plurality of monitoring servers are allocated according to a predetermined rule of a cloud data center. Each monitoring server respectively monitor, save and process the data of different categories of the cloud hosts, such as CPU, hard drive, memory, traffic etc. typically, a single servers has to monitor and process all data of all cloud hosts which generates too much loading for the server to process. With the present invention, the problem occurred to a traditional single server is solved as a result.
Further, traditional cloud data centers save all data of all cloud hosts via single database. Accordingly, when the quantity of the cloud hosts is too many, the saving space of a database may be insufficient, and the capacity has to be upgraded. The present invention allows each monitoring server to play the role of a database. In other words, the quantity of the databases equals to the quantity of the monitoring server, which effectively resolve the insufficient saving space problem of a single database.
The system of the present invention respectively monitors, save and process data of corresponding categories via multiple monitoring servers. As a result, when a monitoring server is damaged, operation of the other monitoring servers is not affected. The system is required to establish a new monitoring server, or leading the cloud hosts to back-up monitoring servers. With the technical solution, the impact on the cloud data center when monitoring servers fail is reduced. Also, each monitoring server is informed which data categories assigned to other monitoring servers. Therefore, when a user inquires specific data of the cloud hosts, the inquiry is effective given the monitoring server are distributed.
The features of the invention believed to be novel are set forth with particularity in the appended claims. The invention itself, however, may be best understood by reference to the following detailed description of the invention, which describes an exemplary embodiment of the invention, taken in conjunction with the accompanying drawings, in which:
Embodiments are provided in the following in order to further detail the implementations of the present invention in the summary. It should be noted that objects used in the diagrams of the embodiments are provided with proportions, dimensions, deformations, displacements and details are examples and the present invention is not limited thereto and identical components in the embodiments are the given same component numbers.
The host 1 and the monitoring server 2 are regarded as a node in the cloud data center, which are implemented with a Physical Machine (PM) or a Virtual Machine (VM), and are not limited thereto. Further, the monitoring system assigns the role of the monitoring server 2 to any or multiple nodes. Accordingly, when the VM is implemented, the same PM both acts the roles as the host 1 and the monitoring server 2. In other words, the host 1 and the monitoring server 2 are not required to be in the PM, and not necessarily to exist alone. A PM acts as multiple roles, and accordingly the system is flexible in operations.
For example, the first monitoring server 201 is used for monitoring the CPU data of the host. The second monitoring server 202 is used for monitoring the hard drive data of the host 1. The third monitoring server 203 is used for monitoring the network traffic of the host 1 etc. Thus, if the cloud data center has a thousand hosts, the CPU data of the thousand hosts is monitored by the first monitoring server 201, the hard drive data is monitored by the second monitoring server 202, and the network traffic data is monitored by the third monitoring server 203.
In addition, the monitoring system can further categorize the data of the host 1 via large quantity of the monitoring server 2. For example, the first monitoring server 201 monitors the CPU usage, the second monitoring server 202 monitors CPU temperature, and the third monitoring server 203 monitor CPU fan rates etc. The three monitoring servers 201-203 collectively monitor the CPU data of the host 1. Nonetheless, the above mentioned is used as a preferred example of the present invention and should not be limited thereto.
As shown in
The allocation data received by the host 1 comprises a distributed hash table (the distributed hash table T1 as shown in
As shown in
The first transferring unit 13 is used for connecting to the plurality of monitoring servers 2, and transferring the status data I1, with reference to the categories, to the corresponding plurality of monitoring servers 2. The host storage pool 14 is used for temporarily saving the detected status data I1 of the sensor unit 12. As mentioned above, the host 1 internally further saves the distributed hash table T1. In addition, the distributed hash table T1 records the plurality of monitoring servers 2 respectively correspond to the status data I1 of specific categories. Thus, when the host 1 transfers the status data I1, the host 1 references the distributed hash table T1 and correctly transfers the status data I1 to the corresponding plurality of monitoring servers 2, which facilitates the plurality of monitoring servers 2 for saving the status data I1 with reference to the categories. In addition, the host 1 respectively processes the status data I1 according to the predetermined rule.
Specifically, when one of the plurality of monitoring servers 2 fails, the host 1 temporarily saves the status data of the corresponding categories I1 of the failed monitoring server 2 via the local database 142. For example, if the first monitoring server 201 is used for saving CPU related data, when the first monitoring server 201 fails, the host 1 transfers the status data I1 not related to CPU, with reference to categories, to the corresponding plurality of monitoring servers 2. The CPU data is temporarily saved in the local database 142. When the first monitoring server 201 is fixed, the host 1 transfers the data temporarily saved in the local database 142 to the first monitoring server 201. Thus, when any of the plurality of monitoring servers 2 fails, the data loss of the status data I1 of the host 1 is avoided.
The second control unit 21 is used for processing each internal data of the monitoring server 2. The second transferring unit 23 is used for connecting to the host 1, and receiving the status data of the corresponding categories I1 transferred by the host 1. The database 22 is used for saving the received status data I1 of the second transferring unit 23. Thus, in the monitoring system, additional databases are not required for saving the data of the host 1, the plurality of monitoring servers 2 are used as multiple databases.
It should be noted that the plurality of monitoring servers 2 respectively have a distributed hash table T2. In addition, the distributed hash table T2 has the same content as the distributed hash table T1 in the host 1. As mentioned above, the distributed hash table T2 records respectively corresponding categories of the plurality of monitoring servers 2, each the monitoring server 2 is informed the corresponding data categories of the other monitoring servers 2 via inquiring the distributed hash table T2. Thus, when any of the monitoring server 2 receives external inquiring requests, the monitoring server 2 is informed which monitoring server 2 has the data targeted by the external inquiring requests via inquiring the distributed hash table T2. Though, the present invention monitors, saves and processes multiple status data I1 of the host 1 via a distributed architecture, it is assured that the data-not-found issues can be avoided.
The analyzing unit 24 is used for analyzing the saved status data I1 of the database 22 for determining if the host 1 has abnormal events, specifically, the analyzing unit 24 is used for determining if an abnormal event of the corresponding categories occurred to the host 1. For example, if the second monitoring server 202 is used for monitoring related data of the hard drive, the analyzing unit 24 of the second monitoring server 202 is used for analyzing the hard drive data of the host 1, and determining if the host 1 has issues such as insufficient hard drive capacity, hard drive sector failure or data lost.
In an embodiment, each monitoring server 2 sets a predetermined threshold value according to categories. In addition, the analyzing unit 24 determines that an abnormal event occurred to the host 1 when the status data I1 exceeds the predetermined threshold value. For example, the first monitoring server 201 monitors the CPU data, and sets the temperature threshold value of the CPU as 60° C. In the embodiment, when the status data I1 indicates that the CPU temperature of the host 1 exceeds 60° C., the first monitoring server 201 determines that an abnormal event occurred to the host 1. The above example is one of the preferred embodiments of the present invention and is not limited thereto.
The informing unit 25 is used for executing an outbound informing procedure when the host 1 is determined having an abnormal event. Specifically, each monitoring server 2 presets a predetermined rule which sets the informing procedures to execute upon corresponding situations. For example, the predetermined rule sets that when the CPU temperature of the host 1 exceeds 60° C., an informing message is sent to the host 1 to instruct the host 1 to increase the fan rate. In addition, the predetermined rule sets that when the CPU temperature of the host 1 exceeds 70° C., another informing message is sent to the administrators of the monitoring system to visit onsite and resolve the abnormal issue. Nonetheless, the above examples are preferred embodiment of the present invention and are not limited thereto.
When the first monitoring server 201 accepts the registration of the host 1, the host 1 receives related allocation data from the first monitoring server 201 (step S24). In addition, the allocation data includes the distributed hash table T1. After the step S24, the host 1 is informed from the distributed hash table T1 about which categories the plurality of monitoring servers 2 respectively correspond to. Accordingly, the host 1 does not need to respectively perform the registration at the other monitoring server 2.
Next, the host 1 detects the host status via the internal sensor unit 12 and generates a plurality of the status data I1 according to the detecting results. The plurality of the status data I1 respectively records the data of different categories (step S26). Lastly, the host 1 references the distributed hash table T1 and transfers the status data I1, with reference to categories, to the corresponding plurality of monitoring servers 2 (step S28). It should be noted that before the host 1 is not powered off (operating as a PM), or not deleted (operating as a VM), the host 1 continues to detect its own status, and generate the status data I1, and transfer the status data I1, with reference to categories, to the corresponding plurality of monitoring servers 2.
Specifically, each monitoring server 2 internally and respectively sets the predetermined threshold value of the above mentioned categories each is responsible for, each monitoring server 2 analyzes if the status data I1 exceeds the predetermined threshold value (step S36). In addition, when the predetermined threshold value is exceeded, an abnormal event occurred to the host 1. If there is no abnormal event detected upon analysis, the method flow moves back to the step S30, each monitoring server 2 continues to receive the status data I1 transferred from the host 1. Nonetheless, if an abnormal event occurred to the host 1 upon analysis, the monitoring server 2 executes the outbound informing procedure according to the above mentioned predetermined rules, (step S38), for directly controlling the host 1, or informing related administrators.
In the example of the third monitoring server 203, when the third monitoring server 203 receives an inquiring request, the third monitoring server 203 first determined if the third monitoring server 203 has the status data I1 of the specific categories (for example the CPU data mentioned above). If yes, the third monitoring server 203 directly replied with the internal saved status data I1 to the inquiring request. If not, the third monitoring server 203 references the distributed hash table T2, and advise the API server 3 or the external terminals 4 to search a specific monitoring server 2 having the status data I1.
Next, as shown in
In the previous embodiment, each monitoring server 2 is implemented respectively by each node. In addition, each unit in the node respectively executes each task. Nonetheless, if there are too many hosts 1 in the monitoring system, for example more than ten thousands or hundreds of thousands hosts. Even each single monitoring server 2 is responsible for monitoring, saving and processing the status data I1 of single category, the overloading risk still exist. Thus, in another embodiment, the loading of each monitoring server 2 is divided and shared by multiple physical or virtual servers which collectively act as a single monitoring server 2, and reduce loading of each server.
The proxy server 51 is used for connecting to the host 1, and receiving the status data of the corresponding categories I1 transferred by the host 1. The proxy server 51 is the connecting interface between the monitoring server 5 and the host 1. The saving server 52 is used for saving the proxy server 51 and the received status data I1 is used as a database of the monitoring server 5.
The analyzing server 53 has algorithm and the above mentioned predetermined threshold value which is used for analyzing the saved status data I1 saved by the saving server 52, and further determining if an abnormal event occurred to the host 1. Different analyzing server 53 has different algorithm and predetermined threshold value. Accordingly, multiple analyzing servers 53 respectively analyze the status data I1 of the different categories of the host 1. The informing server 54 is used for executing corresponding outbound informing procedure when the host 1 is determined to have an abnormal event according to the above mentioned predetermined rule. For example, the host 1 is instructed to resolve the abnormal event, or administrators are informed to arrive onsite to investigate and resolve the events.
Via the methods demonstrated in the above mentioned embodiment, the burden of the server is further distributed. For example, if the status data I1 is divided into categories, In addition, the monitoring server 5 is collectively acted by four servers. Then, in the monitoring system, there are twenty servers monitoring, saving and processing the status data I1 of the host 1. Accordingly, the single server or database is not damaged by overloading.
As the skilled person will appreciate, various changes and modifications can be made to the described embodiments. It is intended to include all such variations, modifications and equivalents which fall within the scope of the invention, as defined in the accompanying claims.
Number | Date | Country | Kind |
---|---|---|---|
101135838 | Sep 2012 | TW | national |