FAULT LOCALIZATION IN A DISTRIBUTED COMPUTING SYSTEM

Information

  • Patent Application
  • 20230297490
  • Publication Number
    20230297490
  • Date Filed
    March 21, 2022
    2 years ago
  • Date Published
    September 21, 2023
    8 months ago
Abstract
Localizing a faulty microservice in a microservice architecture is achieved by developing healthy execution sequence data for comparison to execution sequences during system failures. Oftentimes the faulty microservice does not emit a failure signal. Frequent sub-sequences arising from log template time series data during healthy execution facilitates localization of faulty services when there is no failure signal from the faulty service.
Description
Claims
  • 1. A computer-implemented method for localizing faults, the method comprising: monitoring runtime execution of an application for an occurrence of a request failure, the application communicating with a plurality of resources within a distributed computing system;building a causal graph using erroneous logs generated during a timeframe including the request failure;identifying real-time execution sequences during the timeframe based on paths from a gateway node to a set of leaf nodes according to the causal graph;establishing a set of frequent execution sub-sequences arising during normal operation of the application based on a log template time series dataset from normal execution logs; andidentifying a missing resource of the plurality of resources by analyzing a real-time execution sequence with respect to a matching frequent sub-sequence.
  • 2. The method of claim 1, further comprising: collecting the normal execution logs from the application associated with a plurality of resources; andgenerating the log template time series dataset from the normal execution logs.
  • 3. The method of claim 1, wherein identifying the missing microservice: comparing the real-time execution sequences to the set of frequent sub-sequences arising during normal execution.
  • 4. The method of claim 1, wherein the resources include microservices.
  • 5. The method of claim 1, wherein the set of frequent sub-sequences are labeled individually with a corresponding type of execution flow.
  • 6. The method of claim 1, wherein the missing resource is a faulty microservice.
  • 7. A computer program product comprising a computer-readable storage medium having a set of instructions stored therein which, when executed by a processor, causes the processor to localize faults by: monitoring runtime execution of an application for an occurrence of a request failure, the application communicating with a plurality of resources within a distributed computing system;building a causal graph using erroneous logs generated during a timeframe including the request failure;identifying real-time execution sequences during the timeframe based on paths from a gateway node to a set of leaf nodes according to the causal graph;establishing a set of frequent execution sub-sequences arising during normal operation of the application based on a log template time series dataset from normal execution logs; andidentifying a missing resource of the plurality of resources by analyzing a real-time execution sequence with respect to a matching frequent sub-sequence.
  • 8. The computer program product of claim 7, further causing the processor set to localize faults by: collecting the normal execution logs from the application associated with a plurality of resources; andgenerating the log template time series dataset from the normal execution logs.
  • 9. The computer program product of claim 7, wherein identifying the missing microservice: comparing the real-time execution sequences to the set of frequent sub-sequences arising during normal execution.
  • 10. The computer program product of claim 7, wherein the resources include microservices.
  • 11. The computer program product of claim 7, wherein the set of frequent sub-sequences are labeled individually with a corresponding type of execution flow.
  • 12. The computer program product of claim 7, wherein the missing resource is a faulty microservice.
  • 13. A computer system for localizing faults, the computer system comprising: a processor set; anda computer readable storage medium having program instructions stored therein; wherein: the processor set executes the program instructions that cause the processor set to localize faults by: monitoring runtime execution of an application for an occurrence of a request failure, the application communicating with a plurality of resources within a distributed computing system;building a causal graph using erroneous logs generated during a timeframe including the request failure;identifying real-time execution sequences during the timeframe based on paths from a gateway node to a set of leaf nodes according to the causal graph;establishing a set of frequent execution sub-sequences arising during normal operation of the application based on a log template time series dataset from normal execution logs; andidentifying a missing resource of the plurality of resources by analyzing a real-time execution sequence with respect to a matching frequent sub-sequence.
  • 14. The computer system of claim 13, further causing the processor set to localize faults by: collecting the normal execution logs from the application associated with a plurality of resources; andgenerating the log template time series dataset from the normal execution logs.
  • 15. The computer system of claim 13, wherein identifying the missing microservice: comparing the real-time execution sequences to the set of frequent sub-sequences arising during normal execution.
  • 16. The computer system of claim 13, wherein the resources include microservices.
  • 17. The computer system of claim 13, wherein the set of frequent sub-sequences are labeled individually with a corresponding type of execution flow.
  • 18. The computer system of claim 13, wherein the missing resource is a faulty microservice.
  • 19. A computer-implemented method comprising: determining a system fault has occurred in a computing system;mining normal execution sequences collected during normal operation of the computing system;building a causal graph using erroneous logs generated during the occurrence of the system fault;selecting real-time sequences from the causal graph; andidentifying a missing resource in the real-time sequences by comparing the normal execution sequences to the real-time sequences.
  • 20. The method of claim 19, further comprising: generating a log-template timeseries dataset from the normal execution sequences; andidentifying a set of frequently-arising sub-sequences based on the log-template timeseries dataset;wherein the step of comparing the normal execution sequences is performed by comparing the set of frequently-arising sub-sequences with the real-time sequences.
  • 21. The method of claim 19, further comprising: labeling the frequently-arising sub-sequences individually based on a type of execution flow represented by the sub-sequence.
  • 22. The method of claim 19, wherein the step of mining normal execution sequences is performed automatically in response to detecting the system fault.
  • 23. The method of claim 19, further comprising: adding the missing resource to a localization set for system fault resolution.