METHOD FOR DETECTING AN APPLICATION PROGRESS AND HANDLING AN APPLICATION FAILURE IN A DISTRIBUTED SYSTEM

Information

  • Patent Application
  • 20240282151
  • Publication Number
    20240282151
  • Date Filed
    November 15, 2023
    a year ago
  • Date Published
    August 22, 2024
    5 months ago
Abstract
A method for detecting an application progress and handling an application failure in a distributed system. The method includes: monitoring an interaction between modules of at least one application, the at least one application being deployed across different physical nodes, the interaction being carried out by exchanging messages between the modules using a message broker, the monitoring being carried out at least partially using the message broker; detecting the application progress based on the monitoring; initiating a failure handling based on the detecting.
Description
CROSS REFERENCE

The present application claims the benefit under 35 U.S.C. § 119 of German Patent Application No. DE 10 2023 201 398.3 filed on Feb. 17, 2023, which is expressly incorporated herein by reference in its entirety.


BACKGROUND INFORMATION

In distributed setups, where applications are deployed across different physical nodes, it is important to actively monitor the health of an application composed of a set of interacting modules or services which may be spread across the different nodes. The monitored health of application modules or services can be used to trigger recovery mechanisms, e.g., restart the module or failover to a redundant module.


Conventional mechanisms address the problem of whether a node or application is alive (e.g., by responding to heartbeats or pings from a central co-ordinator). However, conventional methods are not able to consider details of whether an application is indeed progressing. An application may be internally stalled due to live locks or deadlocks, whereas another thread in the application may be actively simply responding to the liveliness checks.


SUMMARY

According to aspects of the present invention, a method, a computer program, and a data processing apparatus are provided. Features and details of the present invention are disclosed herein. Features and details described in the context to the method according to the present invention also correspond to the computer program as well as the data processing apparatus, and vice versa in each case.


One aspect of the present invention comprises a method for detecting an application progress and/or handling an application failure in a distributed system. According to an example embodiment of the present invention, the method comprises according to a first method step monitoring an interaction between modules of at least one application. The at least one application may be deployed across different physical nodes, particularly hardware platform for data processing. The interaction may be carried out by exchanging messages between the modules, preferably using a message broker. Furthermore, the monitoring may be carried out at least partially using the message broker. According to another method step, the method may comprise detecting the application progress based on the monitoring. According to a further method step, the method may comprise initiating a failure handling based on the detecting. The method steps may be carried out one after the other and/or repeatedly. The present invention may thereby allow to detect application progress and to handle application failure in a distributed setup.


The method according to an example embodiment of the present invention may be implemented using a system setup comprising a message broker, particularly centralized enhanced message broker, also referred to as EMB, particularly with an application progress detector, also referred to as APD, and/or a centralized orchestrator and/or a local module manager, the latter particularly on each physical node in the distributed setup. The EMB with an additional APD may be cognizant of the application graph and the interaction between the constituent modules. It may be configured to detect when an application module is not progressing and accordingly deals with unprocessed messages and hands it over to the application module when it is restarted. The EMB may interact with a central orchestrator which may have a global view of the deployment of applications across different nodes, particularly hardware platforms. On each of the nodes, a local module manager (also referred to as LMM) may be used to execute commands sent by the orchestrator. It may also send information regarding the status of the modules and the node resource availability information (periodically and on specific events) to the orchestrator.


According to an example embodiment of the present invention, each application may specify its static architecture (particularly the constituent modules and their interactions) to the APD and additionally specify, for each module, the messages it will publish and subscribe to in a corresponding application manifest. The application, in addition, optionally specifies in the manifest, how its constituent modules interact with each other (via messages) in a normal mode, which may then be used by the EMB to detect deviant behaviour. The manifest may also be augmented with information regarding how the broker must handle messages, e.g., by buffering or evicting them, when it detects that a module is down. The APD may monitor the interactions between different modules by monitoring the time when messages are received by a module and whether it responds correspondingly within a given time, as specified in the application manifest, or learns the trend of interactions and issues a warning when it detects a deviation from the regular behaviour. If the application does not specify specific timing details regarding the receipt and/or publishing of messages, the APD may infer a pattern and send out a warning when it observes a deviation in behaviour.


According to an example embodiment of the present invention, based on the information gathered, the APD may trigger a recovery mechanism for a failed module or also influence a more optimal deployment of the modules. Since the EMB may record the transactions for each application, it can also infer the sequence of interactions leading to a failure or blocking in a module. The proposed solution may have the advantage of avoiding the need for actively probing the application in this mechanism, since the APD infers the liveness and/or progress across different modules based on observing the messages published by different modules.


According to an example embodiment of the present invention, the application may be deployed across different nodes and therefore referred to as distributed applications. The message broker may be able to detect an application progress and to recover the distributed applications using the failure handling. The failure handling therefore may include a recovery mechanism, e.g., restarting at least a part of the application, particularly a module or service of the application, and/or a failover to a redundant module.


According to an example embodiment of the present invention, the node, also referred to as physical node, may be a hardware platform used to execute the applications, wherein the application may comprise a set of interacting modules and/or services that are spread across the different nodes of a distributed system. A distributed system may be understood as a system whose components are located on different networked computers, which communicate and coordinate their actions by passing messages to one another from any system.


Optionally, according to an example embodiment of the present invention, the monitoring is carried out at least partially by the message broker, particularly by using a publish-subscribe-mechanism. The message exchange may be carried out to provide at least one functionality of the at least one application, particularly a driving functionality for a vehicle. The vehicle may be a motor vehicle and/or a passenger vehicle and/or an autonomous vehicle configured for autonomous driving. The message broker, particularly referred to as EMB, may be an intermediary computer program module that translates a message from the formal messaging protocol of the sender to the formal messaging protocol of the receiver. Furthermore, the message broker may provide different message delivery patterns, particularly a publication-subscriber-mechanism, and/or provide message validation and/or message transformation and/or message routing and/or message delivery guarantees and/or may simplify communication. The message broker may be part of a message-oriented middleware.


According to an example embodiment of the present invention, it is possible that each of the at least one application, particularly each of multiple applications, registers with the message broker and specifies the messages to be exchanged, particularly published or subscribed to. The monitoring may be carried out based on observing the specified messages. The usage of a publisher-subscriber-mechanism has the advantage of an efficient message exchange and monitoring of the messages to determine the application progress.


According to an example embodiment of the present invention, it is also possible that an application manifest is provided by each of the at least one (or multiple) applications. The monitoring may be carried out based on the application manifest, particularly by evaluating the application manifest. The application manifest may be specific for and preferably defines at least one of the following specifications of the respective application:

    • a topology,
    • the interactions among the modules,
    • the requirements,
    • a list of the modules,
    • a list of messages published and/or subscribed by each of the modules,
    • information about the timing behaviour for the interactions, particularly the time between receiving input and publishing output messages and/or a maximum backlog of input messages before which an output is expected and/or a minimum number of messages that must be processed successfully,
    • a recovery mechanism to be carried out as part of the failure handling.


According to an example embodiment of the present invention, it is also possible that the failure handling comprises sending a result of the detecting to a central orchestrator of the distributed system to initiate further actions. The central orchestrator may be used at least for spawning and/or terminating and/or suspending and/or migrating the modules, particularly by providing commands from the orchestrator to a local module manager.


According to an example embodiment of the present invention, a local module manager may be provided for deploying and/or stopping and/or starting at least one of the modules on at least one of the different physical nodes, particularly based on the commands from the orchestrator. Additionally, or alternatively, the local module manager may be provided for sending information about a status of the modules and resources of the nodes to the orchestrator. The orchestrator may provide the commands based on this information. The local module manager may be provided on each of the nodes.


Furthermore, according to an example embodiment of the present invention, the detecting the application progress may comprise at least one of the following steps:

    • determining a sequence of messages that are erroneously not processed by at least one of the modules,
    • detecting a failure of at least one of the modules based on the monitoring, particularly based on the determining of the sequence of unprocessed messages,
    • backtracking through the sequence of unprocessed messages for a diagnosis of the source of the failure.


This allows to efficiently determine the application progress, which allows to detect a failure of the application.


According to an example embodiment of the present invention, it is possible that the monitoring comprises at least one of the following steps:

    • determining a duration between receiving an input message and publishing an output message, and particularly detecting the application failure in case the determined duration exceeds a predefined maximum, preferably according to a definition by an application manifest,
    • determining a number (i.e., the amount) of unprocessed messages and detecting the application failure in case the determined number exceeds a predefined maximum, particularly according to a definition by an application manifest,
    • determining a number (i.e., the amount) of processed messages and detecting the application failure in case the determined number falls below a predefined minimum, particularly according to a definition by an application manifest.


This allows to detect the failure of the application based on the monitoring.


Additionally, according to an example embodiment of the present invention, it is possible that a learning phase is provided. The monitoring may comprise a recording of the interaction during the learning phase and may thereby specify a timeout value. Furthermore, the application failure may be detected after the learning phase based on the recorded interaction, particularly by comparing a duration between receiving an input message and publishing an output message with the specified timeout.


In another aspect of the present invention, a computer program may be provided, in particular a computer program product, comprising instructions which, when the computer program is executed by a computer, cause the computer to carry out the method according to the present invention. Thus, the computer program according to the present invention can have the same advantages as have been described in detail with reference to a method according to the present invention.


In another aspect of the present invention, an apparatus for data processing, also referred to as data processing apparatus, may be provided, which is configured to execute the method according to the present invention. As the apparatus, for example, a computer can be provided which executes the computer program according to the present invention. The computer may include at least one processor that can be used to execute the computer program. Also, a non-volatile data memory may be provided in which the computer program may be stored and from which the computer program may be read by the processor for being carried out.


According to another aspect of the present invention a computer-readable storage medium may be provided which comprises the computer program according to the present invention. The storage medium may be formed as a data storage device such as a hard disk and/or a non-volatile memory and/or a memory card and/or a solid-state drive. The storage medium may, for example, be integrated into the computer.


Furthermore, the method according to the present invention may be implemented as a computer-implemented method.


Further advantages, features and details of the present invention will be apparent from the following description, in which embodiments of the present invention are described in detail with reference to the figures. In this connection, the features mentioned herein may each be essential to the present invention individually or in any combination.





BRIEF DESCRIPTION OF THE DRAWINGS


FIG. 1 shows a method, computer program and apparatus according to embodiments of the present invention.



FIG. 2 shows a schematically visualization of embodiments of the present invention.



FIG. 3 shows another schematically visualization of embodiments of the present invention.





DETAILED DESCRIPTION OF EXAMPLE EMBODIMENTS

In the following figures, the identical reference signs are used for the same technical features even of different embodiment examples.


Many modern distributed systems rely on a message broker to support different message delivery patterns, provide message delivery guarantees and in general simplify communication. A common pattern supported is the Publication-Subscriber-Mechanism, or short Pub-Sub, as shown in FIG. 2, that allows disseminating information between data producers (publishers) and data consumers (subscribers), where publishers forward their data through the message broker 30. Pub-Sub is a central piece of many IoT and cloud infrastructures, and it can be found in many popular distributed systems 1 today. A specific problem may be to identify modules or services communicating over a Pub-Sub that are not progressing. In such a distributed setup, an application module may be receiving certain messages, but not reacting on them. The problem is then how to identify a blocked module, in such a case, without deeper insights into the application logic (the application may be regarded as a black box). The task gets more challenging in a distributed setup, composed of applications with complex interactions 40 among different modules. In short, conventional mechanisms usually only deal with the problem of detecting whether a node 50 or application is alive without monitoring application progress. Liveness detection is carried out by some protocols at the message level to meet certain QoS requirements, but it does not address the bigger problem of analysing whether a module is down. Furthermore, conventional liveliness detectors apply application-agnostic liveliness checks, and do not detect deviation in application behaviour with respect to its interactions 40 with others. In addition, in distributed setups, large applications have no conventional mechanisms to trace back application inactivity to specific modules, and then trigger application specific mechanisms from the perspective of the message broker 30.



FIG. 1 shows a method 100 according to embodiments of the present invention for detecting an application progress and handling an application failure in a distributed system 1. Also, a computer program 20 and a data processing apparatus 10 according to embodiments of the present invention is shown.


According to a first method step 101, a monitoring may be carried out. This may include monitoring an interaction 40 between modules M1, M2, M3, M4 of at least one application A1, A2, as shown in FIGS. 2 and 3. The at least one application A1, A2 may be deployed across different physical nodes 50. Furthermore, the interaction 40 may be carried out by exchanging messages between the modules M1, M2, M3, M4 using a message broker 30. The message exchange may be carried out to provide at least one functionality of the at least one application A1, A2, particularly a driving functionality for a vehicle 5. The monitoring may be carried out at least partially using the message broker 30. According to a second method step 102, the application progress may be detected based on the monitoring. According to a third method step 103, a failure handling may be initiated based on the detecting.


Furthermore, an application manifest 60 may be provided by each of the at least one applications A1, A2 and the monitoring 101 may be carried out based on the application manifest 60.


The failure handling may comprise sending a result of the detecting 102 to a central orchestrator 70 of the distributed system 1 to initiate further actions. The central orchestrator 70 may provide commands to a local module manager 90.


Embodiments of the present invention may allow the detection of an application progress, particularly an inactivity, in distributed applications and preferably handling messages by the message broker according to the application semantics to deal with application recovery. Specifically, the node 50 may be up, but the hosted application may not be actively progressing due to various reasons, as, e.g., livelocks, deadlocks, or simply since it was not designed to handle certain inputs, causing it to block. In a distributed setup, such a module may be receiving inputs, but not reacting on them and not processing them to publish inputs.


As exemplarily shown in FIG. 3, the message broker 30, particularly referred to as enhanced message broker 30 or EMB 30, according to embodiments of the present invention, may be configured to establish and implement the publish and subscribe mechanism between different applications. Furthermore, application modules may register with the EMB 30 and specify which messages they publish or subscribe to. The EMB 30 may therefore be cognizant of which modules are actively publishing or subscribing to a specific message. Furthermore, the EMB 30 may additionally have an application progress detector 80 (also referred to as APD 80) which also reads the application manifest 60 and monitors the interactions 40 of each module. It may detect when a module is inactive, i.e., particularly not progressing due to say deadlocks, and may send this information to the central orchestrator 70 to trigger further actions (see 302). It may also log information regarding the messages published and subscribed by each of the modules (see 303, wherein 301 refers to a database 301 used for storing the logged information). The APD 80 may be realized as a submodule of the EMB 30 and may work in tandem with a central orchestrator 70 in a distributed system 1. Furthermore, the EMB 30 may read the application manifest 60 to also decide on the buffering policy (retention policy) of messages for modules that fail. Another use for the EMB 30 may be failure analysis, since the EMB 30 has a knowledge of the interactions 40 among the different modules and a trace of which messages were not processed by a module. When it detects that a module is down, it may backtrack through the sequence of messages leading to the failure. This is especially useful in large applications with complex interactions 40 across different modules. In many cases, an unhandled input sequence (or out of range input) may lead to unexpected behaviour and application stalls. The EMB 30 can then re-engineer the sequence of messages across modules, which lead to the failure and then send it out to the local module manager 90 to forward it to the failed application failure logs to help deeper diagnosis.


The central orchestrator 70 may be configured to control the application lifecycle, e.g., spawning, terminating, suspending, migrating application modules, mapping them to the right nodes 50 to meet their Qos (Quality of Service) requirements and to balance system 1 loads and the like.


The local module manager 90 (LMM 90), as exemplarily shown in FIG. 3, may reside on each node 50 and executes commands by the orchestrator 70 to deploy and/or stop and/or start a module on a given node 50. It may also send information regarding the status of the modules and the node resource availability information (periodically and on specific events) to the orchestrator 70, thereby enabling the orchestrator 70 to make informed decisions.


As shown in FIG. 3, the application manifest 60 may describe the topology of the application, the interactions 40 among the modules and its requirements. The application manifest 60 may comprise a list of the modules and/or an indication of Qos requirements of the application like end-to-end requirements of the application and/or, for each module, at least one of the following:

    • a list of messages published by the module, wherein each message may be identified by a message name and every instance of the message by a message instance number,
    • a list of the messages subscribed by the module,
    • a correlation between the input and output messages (e.g., the module reads input messages T1 and T2 to publish to message T3),
    • a normal module interaction information, wherein the module may in addition give hints regarding the expected or normal timing behaviour (for each set of inputs and outputs), wherein this may be used by the APD 80 to detect deviations or outliers and raise alarms, and this could include either of the following information:
      • an absolute time, particularly the maximum time between receiving inputs and publishing output messages. If the inputs arrive at different times, it may be provided to specify the processing semantics. For example, if module M1 reads T1 and T2, the latest time of the two inputs may be used and from there the time to produce the input may be computed, or, for example, module M1 may read the latest values of T1 and T2 periodically, and then produces T3,
      • a backlog or maximum backlog of inputs before which an output is expected,
      • a m-out-of-k: Some modules may be more resilient to intermittent failures (in the network, etc.). For such modules, a “m out of k” approach is useful. This means the module specifies that out of every “k” set of inputs, at least “m” must be processed successfully,
      • no specific information,
    • expected timing activation patterns of the inputs, for example:
      • periodic, with a specified period,
      • sporadic, with a minimum inter-arrival time,
      • or with arrival curves,
    • a recovery mechanism, particularly the recovery action that must be taken when the module is down, for example:
      • restart the module on the same node 50,
      • kill the module,
      • kill the module and relaunch it on another node 50,
    • a message retention policy, particularly the attributes how the input topics must be handled by the message broker if the receiving module is down, for example:
      • Buffer last “k” messages and forward the last “k” messages to the application module is respawned, or
      • Do not buffer any messages
    • attributes of the hardware required by the module.


Exemplarily algorithms according to the four cases are described below. Depending on the specifications of the expected behaviour (see above) by the application, the APD 80 may take different actions. According to a first case of “absolute time”:

    • 1. The module receives message to which it has subscribed (Input),
    • 2. The module sends an acknowledgement to the APD 80 on the receipt of the input message (s), together with timestamp,
    • 3. The APD 80 on the message broker 30 records the time of an application receiving the input message (s),
    • 4. The APD 80 sets a timeout counter, corresponding to the max time since the application must publish an output message,
    • 5. If the application behaves normally and publishes the output message before the timeout, the APD 80 records the time of receiving the output and the broker forwards the output to the interested subscribers of the output message,
    • 6. If the application does not publish, e.g., output message X before the timeout:
      • a. The timeout interrupt goes off,
      • b. The APD 80 informs the subscribers of the message X that the publisher is down,
      • c. If connected to a central orchestrator 70, it informs the orchestrator 70 that the application is down,
      • d. The orchestrator 70 accordingly triggers the recovery mechanism as per as the module specification to the local module manager 90,
      • e. The orchestrator 70 triggers the deadlock detector/diagnosis module to understand the cause for the deadlock.


According to a second case of “Backlog”, the module may specify a maximum backlog of unprocessed messages:

    • 1. The APD 80 keeps an account of the unprocessed number of input messages (backlog) by the module,
    • 2. When the backlog exceeds the threshold, it sets up an alarm and does the following:
      • a. The APD 80 informs the subscribers of the message X that the publisher is down,
      • b. If connected to a central orchestrator 70, it informs the orchestrator 70 that the application is down,
      • c. The orchestrator 70 accordingly triggers the recovery mechanism as per as the module specification to the local module manager 90,
      • d. The orchestrator 70 triggers the deadlock detector/diagnosis module to understand the cause for the deadlock.


According to a third case of “m-of-k”, the module may specify a constraint in which m of every k input must be processed:

    • 1. The APD 80 keeps a history of the last “k” inputs received by the module and how many of these were processed, for example, using a sliding window or a circular log history of size k,
    • 2. When the number of processed messages is less than “m” in the window of the last k messages, then
      • a. The APD 80 informs the subscribers of the message X that the publisher is down,
      • b. If connected to a central orchestrator 70, it informs the orchestrator 70 that the application is down,
      • c. The orchestrator 70 accordingly triggers the recovery mechanism as per as the module specification to the local module manager 90,
      • d. The orchestrator 70 triggers the deadlock detector/diagnosis module to understand the cause for the deadlock.


According to a fourth case of having no specific information:

    • 1. The APD 80 records the time of an application receiving the input messages and over time, derives (learns) a trend of subscribe/publish behaviour using heuristics or ML techniques,
    • 2. The APD 80 auto-learns this “safe range of time” beyond which it characterizes it as an erroneous situation, and sets a timeout value, relative to the last input received,
    • 3. If the application behaves normally and publishes the output message before the timeout, the APD 80 records the time of receiving the output and the broker forwards the output to the interested subscribers of the output message,
    • 4. If the application does not publish output message X before the timeout:
      • a. The timeout interrupt goes off,
      • b. The APD 80 informs the subscribers of the message X that the publisher is down,
      • c. If connected to a central orchestrator 70, it informs the orchestrator 70 that the application is down,
      • d. The orchestrator 70 accordingly triggers the recovery mechanism as per as the module specification to the local module manager 90,
      • e. The orchestrator 70 triggers the deadlock detector/diagnosis module to understand the cause for the deadlock.


When the application module is relaunched, depending on the specifications in the application manifest 60, either all unprocessed messages may be forwarded to the relaunched module, or the last “k” unprocessed messages may be forwarded to the relaunched module.


The message broker 30 and the orchestrator 70 may be configured as different components or may be also integrated in one component, so that orchestration 70 and message brokering 30 are carried out as two different sub-components in a single application. The enhanced message broker 30 may also delegate the application progress detection 80 responsibilities to the local module manager 90, so that it functions not as one central component, but rather as a distributed component.


The above explanation of the embodiments describes the present invention in the context of examples. Of course, individual features of the embodiments can be freely combined with each other, provided that this is technically reasonable, without leaving the scope of the present invention.

Claims
  • 1. A method for detecting an application progress and handling an application failure in a distributed system, comprising the following steps: monitoring an interaction between modules of at least one application, the at least one application being deployed across different physical nodes, the interaction being carried out by exchanging messages between the modules using a message broker, the monitoring being carried out at least partially using the message broker;detecting the application progress based on the monitoring; andinitiating a failure handling based on the detecting.
  • 2. The method of claim 1, wherein the monitoring is carried out at least partially by the message broker by using a publish-subscribe-mechanism, the message exchange being carried out to provide at least one functionality of the at least one application including a driving functionality for a vehicle.
  • 3. The method of claim 1, wherein the at least one application includes multiple application, and wherein each of the multiple applications registers with the message broker and specifies messages to be exchanged, the monitoring being carried out based on observing the specified messages.
  • 4. The method of claim 3, wherein the messages to be exchanged include messages to be published or subscribed to.
  • 5. The method of claim 1, wherein an application manifest is provided by each respective application of the at least one application, the monitoring being carried out based on the application manifest, the application manifest being specific for at least one of the following specifications of the respective application: a topology,interactions among the modules,requirements,a list of the modules,a list of messages published and/or subscribed by each of the modules,information about a timing behaviour for the interactions including a time between receiving input and publishing output messages and/or a maximum backlog of input messages before which an output is expected and/or a minimum amount of messages that must be processed successfully,a recovery mechanism to be carried out as part of the failure handling.
  • 6. The method of claim 1, wherein the failure handling includes sending a result of the detecting to a central orchestrator of the distributed system to initiate further actions, the central orchestrator being used at least for spawning and/or terminating and/or suspending and/or migrating the modules by providing commands from the orchestrator to a local module manager, the local module manager being provided for deploying and/or stopping and/or starting at least one of the modules on at least one of the different physical nodes based on the commands and/or for sending information about a status of the modules and resources of the nodes to the orchestrator, the orchestrator providing the commands based on the information.
  • 7. The method of claim 1, wherein the detecting of the application progress includes at least one of the following steps: determining a sequence of messages that are erroneously not processed by at least one of the modules,detecting a failure of at least one of the modules based on the monitoring based on the determining of the sequence of unprocessed messages,backtracking through a sequence of unprocessed messages for a diagnosis of a source of the failure.
  • 8. The method of claim 1, wherein the monitoring includes at least one of the following steps: determining a duration between receiving an input message and publishing an output message, and detecting the application failure when the determined duration exceeds a predefined maximum according to a definition by an application manifest,determining a number of unprocessed messages and detecting the application failure where the determined number exceeds a predefined maximum according to a definition by the application manifest,determining a number of processed messages and detecting the application failure when the determined number falls below a predefined minimum according to a definition by the application manifest.
  • 9. The method of claim 1, wherein a learning phase is provided, and wherein the monitoring includes a recording of the interaction during the learning phase, thereby specifying a timeout value, and detecting the application failure after the learning phase based on the recorded interaction by comparing a duration between receiving an input message and publishing an output message with the specified timeout.
  • 10. A non-transitory computer-readable medium on which is stored a computer program including instructions for detecting an application progress and handling an application failure in a distributed system, the instructions, when executed by a computer, causing the computer to perform the following steps: monitoring an interaction between modules of at least one application, the at least one application being deployed across different physical nodes, the interaction being carried out by exchanging messages between the modules using a message broker, the monitoring being carried out at least partially using the message broker;detecting the application progress based on the monitoring; andinitiating a failure handling based on the detecting.
  • 11. A data processing apparatus configured to detect an application progress and handling an application failure in a distributed system, the data processing apparatus configured to: monitor an interaction between modules of at least one application, the at least one application being deployed across different physical nodes, the interaction being carried out by exchanging messages between the modules using a message broker, the monitoring being carried out at least partially using the message broker;detect the application progress based on the monitoring; andinitiate a failure handling based on the detecting.
Priority Claims (1)
Number Date Country Kind
10 2023 201 398.3 Feb 2023 DE national