The present application claims the benefit under 35 U.S.C. § 119 of German Patent Application No. DE 10 2023 201 398.3 filed on Feb. 17, 2023, which is expressly incorporated herein by reference in its entirety.
In distributed setups, where applications are deployed across different physical nodes, it is important to actively monitor the health of an application composed of a set of interacting modules or services which may be spread across the different nodes. The monitored health of application modules or services can be used to trigger recovery mechanisms, e.g., restart the module or failover to a redundant module.
Conventional mechanisms address the problem of whether a node or application is alive (e.g., by responding to heartbeats or pings from a central co-ordinator). However, conventional methods are not able to consider details of whether an application is indeed progressing. An application may be internally stalled due to live locks or deadlocks, whereas another thread in the application may be actively simply responding to the liveliness checks.
According to aspects of the present invention, a method, a computer program, and a data processing apparatus are provided. Features and details of the present invention are disclosed herein. Features and details described in the context to the method according to the present invention also correspond to the computer program as well as the data processing apparatus, and vice versa in each case.
One aspect of the present invention comprises a method for detecting an application progress and/or handling an application failure in a distributed system. According to an example embodiment of the present invention, the method comprises according to a first method step monitoring an interaction between modules of at least one application. The at least one application may be deployed across different physical nodes, particularly hardware platform for data processing. The interaction may be carried out by exchanging messages between the modules, preferably using a message broker. Furthermore, the monitoring may be carried out at least partially using the message broker. According to another method step, the method may comprise detecting the application progress based on the monitoring. According to a further method step, the method may comprise initiating a failure handling based on the detecting. The method steps may be carried out one after the other and/or repeatedly. The present invention may thereby allow to detect application progress and to handle application failure in a distributed setup.
The method according to an example embodiment of the present invention may be implemented using a system setup comprising a message broker, particularly centralized enhanced message broker, also referred to as EMB, particularly with an application progress detector, also referred to as APD, and/or a centralized orchestrator and/or a local module manager, the latter particularly on each physical node in the distributed setup. The EMB with an additional APD may be cognizant of the application graph and the interaction between the constituent modules. It may be configured to detect when an application module is not progressing and accordingly deals with unprocessed messages and hands it over to the application module when it is restarted. The EMB may interact with a central orchestrator which may have a global view of the deployment of applications across different nodes, particularly hardware platforms. On each of the nodes, a local module manager (also referred to as LMM) may be used to execute commands sent by the orchestrator. It may also send information regarding the status of the modules and the node resource availability information (periodically and on specific events) to the orchestrator.
According to an example embodiment of the present invention, each application may specify its static architecture (particularly the constituent modules and their interactions) to the APD and additionally specify, for each module, the messages it will publish and subscribe to in a corresponding application manifest. The application, in addition, optionally specifies in the manifest, how its constituent modules interact with each other (via messages) in a normal mode, which may then be used by the EMB to detect deviant behaviour. The manifest may also be augmented with information regarding how the broker must handle messages, e.g., by buffering or evicting them, when it detects that a module is down. The APD may monitor the interactions between different modules by monitoring the time when messages are received by a module and whether it responds correspondingly within a given time, as specified in the application manifest, or learns the trend of interactions and issues a warning when it detects a deviation from the regular behaviour. If the application does not specify specific timing details regarding the receipt and/or publishing of messages, the APD may infer a pattern and send out a warning when it observes a deviation in behaviour.
According to an example embodiment of the present invention, based on the information gathered, the APD may trigger a recovery mechanism for a failed module or also influence a more optimal deployment of the modules. Since the EMB may record the transactions for each application, it can also infer the sequence of interactions leading to a failure or blocking in a module. The proposed solution may have the advantage of avoiding the need for actively probing the application in this mechanism, since the APD infers the liveness and/or progress across different modules based on observing the messages published by different modules.
According to an example embodiment of the present invention, the application may be deployed across different nodes and therefore referred to as distributed applications. The message broker may be able to detect an application progress and to recover the distributed applications using the failure handling. The failure handling therefore may include a recovery mechanism, e.g., restarting at least a part of the application, particularly a module or service of the application, and/or a failover to a redundant module.
According to an example embodiment of the present invention, the node, also referred to as physical node, may be a hardware platform used to execute the applications, wherein the application may comprise a set of interacting modules and/or services that are spread across the different nodes of a distributed system. A distributed system may be understood as a system whose components are located on different networked computers, which communicate and coordinate their actions by passing messages to one another from any system.
Optionally, according to an example embodiment of the present invention, the monitoring is carried out at least partially by the message broker, particularly by using a publish-subscribe-mechanism. The message exchange may be carried out to provide at least one functionality of the at least one application, particularly a driving functionality for a vehicle. The vehicle may be a motor vehicle and/or a passenger vehicle and/or an autonomous vehicle configured for autonomous driving. The message broker, particularly referred to as EMB, may be an intermediary computer program module that translates a message from the formal messaging protocol of the sender to the formal messaging protocol of the receiver. Furthermore, the message broker may provide different message delivery patterns, particularly a publication-subscriber-mechanism, and/or provide message validation and/or message transformation and/or message routing and/or message delivery guarantees and/or may simplify communication. The message broker may be part of a message-oriented middleware.
According to an example embodiment of the present invention, it is possible that each of the at least one application, particularly each of multiple applications, registers with the message broker and specifies the messages to be exchanged, particularly published or subscribed to. The monitoring may be carried out based on observing the specified messages. The usage of a publisher-subscriber-mechanism has the advantage of an efficient message exchange and monitoring of the messages to determine the application progress.
According to an example embodiment of the present invention, it is also possible that an application manifest is provided by each of the at least one (or multiple) applications. The monitoring may be carried out based on the application manifest, particularly by evaluating the application manifest. The application manifest may be specific for and preferably defines at least one of the following specifications of the respective application:
According to an example embodiment of the present invention, it is also possible that the failure handling comprises sending a result of the detecting to a central orchestrator of the distributed system to initiate further actions. The central orchestrator may be used at least for spawning and/or terminating and/or suspending and/or migrating the modules, particularly by providing commands from the orchestrator to a local module manager.
According to an example embodiment of the present invention, a local module manager may be provided for deploying and/or stopping and/or starting at least one of the modules on at least one of the different physical nodes, particularly based on the commands from the orchestrator. Additionally, or alternatively, the local module manager may be provided for sending information about a status of the modules and resources of the nodes to the orchestrator. The orchestrator may provide the commands based on this information. The local module manager may be provided on each of the nodes.
Furthermore, according to an example embodiment of the present invention, the detecting the application progress may comprise at least one of the following steps:
This allows to efficiently determine the application progress, which allows to detect a failure of the application.
According to an example embodiment of the present invention, it is possible that the monitoring comprises at least one of the following steps:
This allows to detect the failure of the application based on the monitoring.
Additionally, according to an example embodiment of the present invention, it is possible that a learning phase is provided. The monitoring may comprise a recording of the interaction during the learning phase and may thereby specify a timeout value. Furthermore, the application failure may be detected after the learning phase based on the recorded interaction, particularly by comparing a duration between receiving an input message and publishing an output message with the specified timeout.
In another aspect of the present invention, a computer program may be provided, in particular a computer program product, comprising instructions which, when the computer program is executed by a computer, cause the computer to carry out the method according to the present invention. Thus, the computer program according to the present invention can have the same advantages as have been described in detail with reference to a method according to the present invention.
In another aspect of the present invention, an apparatus for data processing, also referred to as data processing apparatus, may be provided, which is configured to execute the method according to the present invention. As the apparatus, for example, a computer can be provided which executes the computer program according to the present invention. The computer may include at least one processor that can be used to execute the computer program. Also, a non-volatile data memory may be provided in which the computer program may be stored and from which the computer program may be read by the processor for being carried out.
According to another aspect of the present invention a computer-readable storage medium may be provided which comprises the computer program according to the present invention. The storage medium may be formed as a data storage device such as a hard disk and/or a non-volatile memory and/or a memory card and/or a solid-state drive. The storage medium may, for example, be integrated into the computer.
Furthermore, the method according to the present invention may be implemented as a computer-implemented method.
Further advantages, features and details of the present invention will be apparent from the following description, in which embodiments of the present invention are described in detail with reference to the figures. In this connection, the features mentioned herein may each be essential to the present invention individually or in any combination.
In the following figures, the identical reference signs are used for the same technical features even of different embodiment examples.
Many modern distributed systems rely on a message broker to support different message delivery patterns, provide message delivery guarantees and in general simplify communication. A common pattern supported is the Publication-Subscriber-Mechanism, or short Pub-Sub, as shown in
According to a first method step 101, a monitoring may be carried out. This may include monitoring an interaction 40 between modules M1, M2, M3, M4 of at least one application A1, A2, as shown in
Furthermore, an application manifest 60 may be provided by each of the at least one applications A1, A2 and the monitoring 101 may be carried out based on the application manifest 60.
The failure handling may comprise sending a result of the detecting 102 to a central orchestrator 70 of the distributed system 1 to initiate further actions. The central orchestrator 70 may provide commands to a local module manager 90.
Embodiments of the present invention may allow the detection of an application progress, particularly an inactivity, in distributed applications and preferably handling messages by the message broker according to the application semantics to deal with application recovery. Specifically, the node 50 may be up, but the hosted application may not be actively progressing due to various reasons, as, e.g., livelocks, deadlocks, or simply since it was not designed to handle certain inputs, causing it to block. In a distributed setup, such a module may be receiving inputs, but not reacting on them and not processing them to publish inputs.
As exemplarily shown in
The central orchestrator 70 may be configured to control the application lifecycle, e.g., spawning, terminating, suspending, migrating application modules, mapping them to the right nodes 50 to meet their Qos (Quality of Service) requirements and to balance system 1 loads and the like.
The local module manager 90 (LMM 90), as exemplarily shown in
As shown in
Exemplarily algorithms according to the four cases are described below. Depending on the specifications of the expected behaviour (see above) by the application, the APD 80 may take different actions. According to a first case of “absolute time”:
According to a second case of “Backlog”, the module may specify a maximum backlog of unprocessed messages:
According to a third case of “m-of-k”, the module may specify a constraint in which m of every k input must be processed:
According to a fourth case of having no specific information:
When the application module is relaunched, depending on the specifications in the application manifest 60, either all unprocessed messages may be forwarded to the relaunched module, or the last “k” unprocessed messages may be forwarded to the relaunched module.
The message broker 30 and the orchestrator 70 may be configured as different components or may be also integrated in one component, so that orchestration 70 and message brokering 30 are carried out as two different sub-components in a single application. The enhanced message broker 30 may also delegate the application progress detection 80 responsibilities to the local module manager 90, so that it functions not as one central component, but rather as a distributed component.
The above explanation of the embodiments describes the present invention in the context of examples. Of course, individual features of the embodiments can be freely combined with each other, provided that this is technically reasonable, without leaving the scope of the present invention.
Number | Date | Country | Kind |
---|---|---|---|
10 2023 201 398.3 | Feb 2023 | DE | national |