FAULT LOCALIZATION IN A DISTRIBUTED COMPUTING SYSTEM

Information

Patent Application
20230297490

References
Source

Publication Number
20230297490
Date Filed
March 21, 2022
3 years ago
Date Published
September 21, 2023
a year ago

Inventors
Original Assignees
- International Business Machines Corporation

CPC
- G06F11/3612 - by runtime analysis
- G06F11/366 - using diagnostics
International Classifications
- G06F11/36

Information

Abstract

Localizing a faulty microservice in a microservice architecture is achieved by developing healthy execution sequence data for comparison to execution sequences during system failures. Oftentimes the faulty microservice does not emit a failure signal. Frequent sub-sequences arising from log template time series data during healthy execution facilitates localization of faulty services when there is no failure signal from the faulty service.

Description

Claims

1. A computer-implemented method for localizing faults, the method comprising: monitoring runtime execution of an application for an occurrence of a request failure, the application communicating with a plurality of resources within a distributed computing system;building a causal graph using erroneous logs generated during a timeframe including the request failure;identifying real-time execution sequences during the timeframe based on paths from a gateway node to a set of leaf nodes according to the causal graph;establishing a set of frequent execution sub-sequences arising during normal operation of the application based on a log template time series dataset from normal execution logs; andidentifying a missing resource of the plurality of resources by analyzing a real-time execution sequence with respect to a matching frequent sub-sequence.
2. The method of claim 1, further comprising: collecting the normal execution logs from the application associated with a plurality of resources; andgenerating the log template time series dataset from the normal execution logs.
3. The method of claim 1, wherein identifying the missing microservice: comparing the real-time execution sequences to the set of frequent sub-sequences arising during normal execution.
4. The method of claim 1, wherein the resources include microservices.
5. The method of claim 1, wherein the set of frequent sub-sequences are labeled individually with a corresponding type of execution flow.
6. The method of claim 1, wherein the missing resource is a faulty microservice.
7. A computer program product comprising a computer-readable storage medium having a set of instructions stored therein which, when executed by a processor, causes the processor to localize faults by: monitoring runtime execution of an application for an occurrence of a request failure, the application communicating with a plurality of resources within a distributed computing system;building a causal graph using erroneous logs generated during a timeframe including the request failure;identifying real-time execution sequences during the timeframe based on paths from a gateway node to a set of leaf nodes according to the causal graph;establishing a set of frequent execution sub-sequences arising during normal operation of the application based on a log template time series dataset from normal execution logs; andidentifying a missing resource of the plurality of resources by analyzing a real-time execution sequence with respect to a matching frequent sub-sequence.
8. The computer program product of claim 7, further causing the processor set to localize faults by: collecting the normal execution logs from the application associated with a plurality of resources; andgenerating the log template time series dataset from the normal execution logs.
9. The computer program product of claim 7, wherein identifying the missing microservice: comparing the real-time execution sequences to the set of frequent sub-sequences arising during normal execution.
10. The computer program product of claim 7, wherein the resources include microservices.
11. The computer program product of claim 7, wherein the set of frequent sub-sequences are labeled individually with a corresponding type of execution flow.
12. The computer program product of claim 7, wherein the missing resource is a faulty microservice.
13. A computer system for localizing faults, the computer system comprising: a processor set; anda computer readable storage medium having program instructions stored therein; wherein: the processor set executes the program instructions that cause the processor set to localize faults by: monitoring runtime execution of an application for an occurrence of a request failure, the application communicating with a plurality of resources within a distributed computing system;building a causal graph using erroneous logs generated during a timeframe including the request failure;identifying real-time execution sequences during the timeframe based on paths from a gateway node to a set of leaf nodes according to the causal graph;establishing a set of frequent execution sub-sequences arising during normal operation of the application based on a log template time series dataset from normal execution logs; andidentifying a missing resource of the plurality of resources by analyzing a real-time execution sequence with respect to a matching frequent sub-sequence.
14. The computer system of claim 13, further causing the processor set to localize faults by: collecting the normal execution logs from the application associated with a plurality of resources; andgenerating the log template time series dataset from the normal execution logs.
15. The computer system of claim 13, wherein identifying the missing microservice: comparing the real-time execution sequences to the set of frequent sub-sequences arising during normal execution.
16. The computer system of claim 13, wherein the resources include microservices.
17. The computer system of claim 13, wherein the set of frequent sub-sequences are labeled individually with a corresponding type of execution flow.
18. The computer system of claim 13, wherein the missing resource is a faulty microservice.
19. A computer-implemented method comprising: determining a system fault has occurred in a computing system;mining normal execution sequences collected during normal operation of the computing system;building a causal graph using erroneous logs generated during the occurrence of the system fault;selecting real-time sequences from the causal graph; andidentifying a missing resource in the real-time sequences by comparing the normal execution sequences to the real-time sequences.
20. The method of claim 19, further comprising: generating a log-template timeseries dataset from the normal execution sequences; andidentifying a set of frequently-arising sub-sequences based on the log-template timeseries dataset;wherein the step of comparing the normal execution sequences is performed by comparing the set of frequently-arising sub-sequences with the real-time sequences.
21. The method of claim 19, further comprising: labeling the frequently-arising sub-sequences individually based on a type of execution flow represented by the sub-sequence.
22. The method of claim 19, wherein the step of mining normal execution sequences is performed automatically in response to detecting the system fault.
23. The method of claim 19, further comprising: adding the missing resource to a localization set for system fault resolution.

FAULT LOCALIZATION IN A DISTRIBUTED COMPUTING SYSTEM

Information

Publication Number

Date Filed

Date Published

Inventors

Original Assignees

CPC

International Classifications

Abstract

Description

Claims