The present application claims priority to Chinese Patent Application No. 202410050110.2, filed Jan. 12, 2024, and entitled “Method, Device, and Computer Program Product for Monitoring Distributed System,” which is incorporated by reference herein in its entirety.
Embodiments of the present disclosure relate to the field of distributed computing, and more particularly, to a method, a device, and a computer program product for monitoring a distributed system.
Currently, with the increasing scale of large-scale datasets, distributed systems have been widely applied due to their ability to process large amounts of data. Analyzing the performance indexes of the distributed systems is an effective way to optimize distributed computing work in the application process of the distributed systems.
Typically, the performance indexes of the distributed systems include job level performance indexes, system level performance indexes, and micro-architecture level performance indexes. The job level performance indexes can be obtained by monitoring specified computing tasks of the distributed systems, and most distributed computing frameworks can provide monitoring for the job level performance indexes. In addition, the system level performance indexes can also be obtained by monitoring the distributed systems through cluster resource management services.
Embodiments of the present disclosure provide a method, a device, and a computer program product for monitoring a distributed system. In a first aspect of embodiments of the present disclosure, a method for monitoring a distributed system is provided. The method includes starting a distributed computing service on a first node in response to receiving a distributed task. The method further includes sending a request for registering an application manager from the first node to a second node. The method further includes receiving an application identification of the application manager from the second node by the first node. The method further includes determining process identifications of containers in a cluster resource management layer by the first node based on the received application identification. The method further includes monitoring performance indexes of the distributed system based on the process identifications.
In a second aspect of embodiments of the present disclosure, an electronic device is provided. The electronic device includes at least one processor, and a memory coupled to the at least one processor and having instructions stored thereon, wherein the instructions, when executed by the at least one processor, cause the electronic device to perform actions for monitoring a distributed system. The actions comprise starting a distributed computing service on a first node in response to receiving a distributed task, sending a request for registering an application manager from the first node to a second node, receiving an application identification of the application manager from the second node by the first node, determining process identifications of containers in a cluster resource management layer by the first node based on the received application identification, and monitoring performance indexes of the distributed system based on the process identifications.
In a third aspect of embodiments of the present disclosure, a computer program product is provided, wherein the computer program product is tangibly stored on a non-transitory computer-readable medium and comprises machine-executable instructions that, when executed by a machine, cause the machine to perform actions for monitoring a distributed system. The actions comprise starting a distributed computing service on a first node in response to receiving a distributed task, sending a request for registering an application manager from the first node to a second node, receiving an application identification of the application manager from the second node by the first node, determining process identifications of containers in a cluster resource management layer by the first node based on the received application identification, and monitoring performance indexes of the distributed system based on the process identifications.
It should be understood that the content described in this Summary is neither intended to limit key or essential features of embodiments of the present disclosure, nor intended to limit the scope of the present disclosure. Other features of the present disclosure will become readily understood from the following description.
The above and other features, advantages, and aspects of embodiments of the present disclosure will become more apparent in conjunction with the accompanying drawings and with reference to the following Detailed Description. In the accompanying drawings, identical or similar reference numerals represent identical or similar elements, in which:
In all the accompanying drawings, identical or similar reference numerals indicate identical or similar elements.
Illustrative embodiments of the present disclosure will be described below in further detail with reference to the accompanying drawings. Although the accompanying drawings show some embodiments of the present disclosure, it should be understood that the present disclosure may be implemented in various forms, and should not be construed as being limited to the embodiments stated herein. Rather, these embodiments are provided for understanding the present disclosure more thoroughly and completely. It should be understood that the accompanying drawings and embodiments of the present disclosure are for exemplary purposes only, and are not intended to limit the scope of protection of the present disclosure.
In the description of the embodiments of the present disclosure, the term “include” and similar terms thereof should be understood as open-ended inclusion, i.e., “including but not limited to.” The term “based on” should be understood as “based at least in part on.” The term “an embodiment” or “the embodiment” should be understood as “at least one embodiment.” The terms “first,” “second,” and the like may refer to different or identical objects, unless explicitly illustrated. Other explicit and implicit definitions may also be included below.
As mentioned above, the job level performance indexes refer to some indexes that directly measure the performance of a specified computing task. Different from the job level performance indexes, the system level performance indexes refer to general system performance of specific tasks, such as memory usage, I/O speed, and network throughput. The micro-architecture level performance indexes refer to hardware implementation of ISA (Instruction Set Architecture). Compared with the system level performance indexes, the micro-architecture level performance indexes remain at a lower level in a computer architecture. Some common micro-architecture level performance indexes include IPC (instructions/cycle), OCBW (off-chip bandwidth utilization), etc.
At present, most distributed systems provide job level index monitoring, and there are also some performance analyzers that can simultaneously monitor the system level performance indexes and the micro-architecture level performance indexes. However, for the distributed systems, ordinary performance analyzers usually do not work. Generally, the system performance analyzers can only be used by a single node and can only monitor single processes. Specifically, in the working process of an existing system performance analyzer, a task allocator in a distributed system is usually started first to achieve monitoring of the task allocator. However, due to the single process and single node monitoring characteristics of the system performance analyzer, after monitoring the task allocator, it is impossible to monitor task executors distributed on different nodes. Therefore, so far, there has not been a good integration between the system performance analyzer and the distributed system, and there is no comprehensive monitoring method that can simultaneously monitor the three performance indexes of the distributed system mentioned above.
For this purpose, embodiments of the present disclosure provide a solution for monitoring a distributed system. In this solution, a distributed computing service is started on a first node, thereby monitoring the distributed computing service. The first node receives an application identification of an application manager from a second node by monitoring the distributed service, then the first node may determine process identifications of containers in a cluster resource management layer according to the application identification, and in this way, nodes in the distributed system may position and monitor the containers in time, thereby achieving the purpose of monitoring performance indexes of the distributed system. In this way, various nodes in the distributed system can communicate, containers distributed on the nodes are positioned and monitored in time, the performance indexes of the distributed system are thoroughly monitored from two levels including a system level and a micro-architecture level, and the efficiency of optimizing the computing work is improved.
Referring to
In some embodiments, a monitoring process of the distributed system in the present disclosure is as follows, at 123, the distributed computing performance monitoring layer 115 receives a distributed computing task 114. At 118, after receiving the distributed computing task 114, the distributed computing performance monitoring layer 115 starts the distributed computing service layer 101 to monitor the distributed computing service layer 101. At 119, the distributed computing service layer 101 divides the distributed computing task 114 into small blocks and communicates with the cluster resource management layer 106 to obtain resource support required by distributed computing. At 120, by monitoring the distributed computing service layer 101, the distributed computing performance monitoring layer 115 may obtain an application identification of an application manager sent to the distributed computing service layer 101 by the cluster resource management layer 106. Process identifications of the containers in the cluster resource management layer 106 may be determined based on the application identification of the application manager, and thus positioning and monitoring of the containers are achieved. At 117, the distributed computing performance monitoring layer 115 stores in the distributed file storage system 116 the performance indexes obtained by monitoring, for retrieval by users.
As shown in
As shown in
It is apparent from the above description that, in the solution of embodiments of the present disclosure, the distributed computing performance monitoring layer is configured in the distributed system, and the distributed computing performance monitoring layer is used for positioning and monitoring the containers in the cluster resource management layer, so that task executors which are started implicitly are searched for, and their comprehensive performance indexes are automatically monitored. In addition, in embodiments of the present disclosure, the performance indexes of the distributed system are thoroughly monitored from two levels including a system level and a micro-architecture level, and the efficiency of optimizing the computing work is improved.
It should be understood that description of the architecture and function in the example environments 100A and 100B is made for illustrative purposes only and does not imply any limitation to the scope of the present disclosure. Embodiments of the present disclosure may also be applied to other environments with different architectures and/or functions.
A process of an embodiment of the present disclosure will be described in detail below with reference to
At block 204, a request for registering an application manager is sent from the node to another node. For example, in the examples shown in
At block 206, an application identification of the application manager is received from the other node by the node. For example, as shown in
At block 208, process identifications of containers in the cluster resource management layer are determined by the node based on the received application identification. For example, as shown in
At block 210, performance indexes of the distributed system are monitored based on the process identifications. For the sake of system safety, the task executors are implicit components, identifications of the task executors are obscure and unreal for users, and process identifications of the task executors cannot be determined through exposed identifications of the task executors at present. However, the process identifications of the containers include time stamps and current application identifications, and the nodes in the distributed system may accurately and timely position the containers by determining the process identifications of the containers, and may monitor the containers to search the task executors which are started implicitly and automatically monitor their comprehensive performance indexes.
In this way, various nodes in the distributed system can communicate, the containers distributed on the nodes are positioned and monitored in time to search the task executors which are started implicitly, the performance indexes of the distributed system are thoroughly monitored from two levels including a system level and a micro-architecture level, and the efficiency of optimizing the computing work is improved.
The process of monitoring the distributed system will be specifically described below with reference to
In some embodiments, the monitoring layer module 311 on the node 319 receiving a distributed computing task 301 is taken as an example for explanation and illustration. At 313, the monitoring layer module 311 receives the distributed computing task 301. At 314, the monitoring layer module 311 starts a distributed computing service layer 302 on the node 319. At 315, the distributed computing service layer 302 sends a request for registering an application manager 306 to a cluster resource management layer 303. After being registered, the application manager 306 will manage and schedule computing resources for current distributed work, for example, the application manager deploys a container 305 on the node 319, deploys a container 307 on the node 320, and deploys a container 308 and a container 309 on the node 318. It should be understood that for a new distributed task, the application manager 306 needs to allocate a new container to provide resource support, and thus containers located on different nodes are not generated at one time, but updated continuously.
After the application manager 306 is registered successfully, an application identification is generated with it. At 316, the node 319 receives the application identification from the application manager 306, and in other words, a starter in the distributed computing service layer 302 receives the application identification from the application manager 306. At 317, the monitoring layer module 311 located on the node 319 receives the application identification of the application manager 306 from the distributed computing service layer 302, that is, the node 319 obtains the application identification.
In some embodiments, after a node 417 determines a process identification, a node 418 and a node 416 may monitor the distributed system by passively receiving the process identification determined by the node 417. At 415-1, the node 416 may receive the process identifications of the container 408 and the container 409 from the node 417. At 415-2, the node 418 may receive the process identification of the container 407 from the node 417. The passive monitoring method may reduce the step that the nodes actively read the application identification, thereby reducing communication resources required between the cluster resource management layer 403 and the distributed computing performance monitoring layer 404.
After process identifications are determined, at 515-3, a node 517 may monitor a container 505 according to a process identification of the container 505, and at 514-2, the node 517 may upload performance indexes obtained by monitoring to the distributed file storage system 513; at 515-4, a node 518 may monitor a container 507 according to a process identification of the container 507, and at 514-3, the node 518 may upload performance indexes obtained by monitoring to the distributed file storage system 513; and at 515-1 and 515-2, a node 516 may monitor a container 508 and a container 509 respectively according to process identifications of the container 508 and the container 509, and at 514-1, the node 516 may upload performance indexes obtained by monitoring to the distributed file storage system 513. The distributed file storage system 513 stores the received system performance indexes by categories according to categories for retrieval by users.
After obtaining the application identification, respective nodes in the distributed system may monitor an application manager 606 according to the application identification to read a log. At 613-1, a monitoring layer module 611 located at the node 616 monitors the application manager 606 through the application identification; at 613-3, a monitoring layer module 612 located at a node 617 monitors the application manager 606 through the application identification; and at 613-2, a monitoring layer module 610 located at a node 615 monitors the application manager 606 through the application identification.
After the nodes obtain the performance indexes, at 714-2, the node 717 may upload the performance indexes obtained by monitoring to a distributed file storage system 713; at 714-3, the node 718 may upload the performance indexes obtained by monitoring to the distributed file storage system 713; and at 714-1, the node 716 may upload the performance indexes obtained by monitoring to the distributed file storage system 713. The distributed file storage system 713 stores the received system performance indexes by categories according to categories for retrieval by users.
A plurality of parts in the device 800 are connected to the I/O interface 805, including: an input unit 806, such as a keyboard and a mouse; an output unit 807, such as various types of displays and speakers; a storage unit 808, such as a magnetic disk and an optical disc; and a communication unit 809, such as a network card, a modem, and a wireless communication transceiver. The communication unit 809 allows the device 800 to exchange information/data with other devices via a computer network, such as the Internet, and/or various telecommunication networks.
The computing unit 801 may be various general-purpose and/or special-purpose processing components with processing and computing capabilities. Some examples of the computing unit 801 include, but are not limited to, central processing units (CPUs), graphics processing units (GPUs), various specialized artificial intelligence (AI) computing chips, various computing units for running machine learning model algorithms, digital signal processors (DSPs), and any appropriate processors, controllers, microcontrollers, etc. The computing unit 801 performs various methods and processes described above, such as the method 200. For example, in some embodiments, the method 200 may be implemented as a computer software program that is tangibly included in a machine-readable medium such as the storage unit 808. In some embodiments, some or all of the computer program may be loaded and/or installed onto the device 800 via the ROM 802 and/or the communication unit 809. When the computer program is loaded to the RAM 803 and executed by the computing unit 801, one or more steps of the method 200 described above may be performed. Alternatively, in other embodiments, the computing unit 801 may be configured to implement the method 200 in any other suitable manners (such as by means of firmware).
The functions described hereinabove may be performed at least in part by one or more hardware logic components. For example, without limitation, example types of available hardware logic components include: a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), an Application Specific Standard Product (ASSP), a System on Chip (SOC), a Complex Programmable Logic Device (CPLD), and the like.
Program code for implementing the method of the present disclosure may be written by using one programming language or any combination of a plurality of programming languages. The program code may be provided to a processor or controller of a general purpose computer, a special purpose computer, or another programmable data processing apparatus, such that the program code, when executed by the processor or controller, implements the functions/operations specified in the flow charts and/or block diagrams. The program code may be executed completely on a machine, executed partially on a machine, executed partially on a machine and partially on a remote machine as a stand-alone software package, or executed completely on a remote machine or server.
In the context of the present disclosure, a machine-readable medium may be a tangible medium that may include or store a program for use by an instruction execution system, apparatus, or device or in connection with the instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the above content. More specific examples of the machine-readable storage medium may include one or more wire-based electrical connections, a portable computer diskette, a hard disk, RAM, ROM, an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combinations thereof. Additionally, although operations are depicted in a particular order, this should be understood that such operations are required to be performed in the particular order shown or in a sequential order, or that all illustrated operations should be performed to achieve desirable results. Under certain environments, multitasking and parallel processing may be advantageous. Likewise, although the above discussion contains several specific implementation details, these should not be construed as limitations to the scope of the present disclosure. Certain features that are described in the context of separate embodiments may also be implemented in combination in a single implementation. Conversely, various features that are described in the context of a single implementation may also be implemented in a plurality of implementations separately or in any suitable sub-combination.
Although the present subject matter has been described using a language specific to structural features and/or method logical actions, it should be understood that the subject matter defined in the appended claims is not necessarily limited to the particular features or actions described above. Rather, the specific features and actions described above are merely example forms of implementing the claims.
| Number | Date | Country | Kind |
|---|---|---|---|
| 202410050110.2 | Jan 2024 | CN | national |