In some instances, applications may rely on a technology infrastructure to ensure their operation. Accordingly, it may be important to create a reliable technology infrastructure and minimize the occurrence of any corresponding failures/outages. In some instances, a current system performance may be analyzed to identify a likelihood of failure. In some instances, however, a series of previous events occurring in a time series leading up to a current time may be relevant to the analysis. In failing to consider such information, an accuracy of failure detection may be reduced. Accordingly, it may be important to improve the process of preemptive failure detection to prevent system failures and/or outages.
Aspects of the disclosure provide effective, efficient, scalable, and convenient technical solutions that address and overcome the technical problems associated with system failure prediction and prevention. In accordance with one or more embodiments of the disclosure, a computing platform comprising at least one processor, a communication interface, and memory storing computer-readable instructions may configure a rules-based state machine to predict system failure for a system based on telemetry state images and transitions between the telemetry state images. The computing platform may receive initial telemetry data. The computing platform may generate, based on the initial telemetry data, an initial telemetry state image. The computing platform may receive additional telemetry data. The computing platform may generate, based on the additional telemetry data, an additional telemetry state image. The computing platform may compare a pattern, corresponding to the initial telemetry state image, the additional telemetry state image, and a transition between the initial telemetry state image and the additional telemetry image, to the telemetry state images and the transitions between the telemetry state images of the rules-based state machine to identify a matching pattern. The computing platform may identify, using the identified matching pattern, a likelihood of failure for the system. The computing platform may send, based on the likelihood of failure for the system, one or more preemptive resolution commands causing modification of operations at the system to prevent a predicted failure.
In one or more instances, configuring the rules-based state machine may include: 1) receiving historical telemetry data; 2) normalizing the historical telemetry data; 3) generating, based on the historical telemetry data, the telemetry state images; 4) identifying the transitions between the telemetry state images; and 5) labeling historical patterns corresponding to the telemetry state images and the transitions between the telemetry state images based on detected failures. In one or more instances, generating, based on the initial telemetry data, the initial telemetry state image may include: 1) normalizing the initial telemetry data, and 2) generating the initial telemetry state image based on the normalized initial telemetry data. Generating, based on the additional telemetry data, the additional telemetry state image may include: 1) normalizing the additional telemetry data, and generating the additional telemetry state image based on the normalized additional telemetry data.
In one or more examples, comparing the pattern to the telemetry state images and the transitions between the telemetry state images of the rules-based state machine to identify the matching pattern may include using an image matching model to: 1) identify a match between the initial telemetry state image and a first image of the telemetry state images, and 2) identify a match between the additional telemetry state images and a second image of the telemetry state images, where the second image of the telemetry state images may be linked to the first image of the telemetry state images within the rules-based state machine, and where a transition between the initial telemetry state image and the additional telemetry state image may match a transition between the first image and the second image.
In one or more instances, identifying, using the identified matching pattern, the likelihood of failure for the system may include identifying a likelihood of failure of the matching pattern, where the matching pattern may be labelled based on the likelihood of failure of the matching pattern. In one or more instances, the computing platform may compare the likelihood of failure of the matching pattern to a failure threshold, and sending the one or more preemptive resolution commands causing modification of the operations at the system to prevent the predicted failure may be in response to identifying that the likelihood of failure of the matching pattern meets or exceeds the failure threshold.
In one or more examples, sending the one or more preemptive resolution commands may include directing a load management server associated with the system to redirect incoming requests away from the system. In one or more examples, sending the one or more preemptive resolution commands may include directing a user device to display a recommended solution to avoid the predicted failure along with a prompt for whether or not the recommended solution should be executed.
In one or more instances, the computing platform may receive user input accepting the recommended solution. The computing platform may execute, in response to receiving the user input, the recommended solution.
In one or more examples, the computing platform may receive third telemetry data. The computing platform may generate, based on the third telemetry data, a third telemetry state image. The computing platform may compare an updated pattern, corresponding to the initial telemetry state image, the additional telemetry state image, the transition between the initial telemetry state image and the additional telemetry state image, the third telemetry state image, and a transition between the additional telemetry state image and the third telemetry state image, to the telemetry state images and the transitions of the rules-based state machine to identify an updated matching pattern. The computing platform may identify, using the identified updated matching pattern, a new likelihood of failure for the system, which may be different than the likelihood of failure.
The present disclosure is illustrated by way of example and is not limited in the accompanying figures in which like reference numerals indicate similar elements and in which:
In the following description of various illustrative embodiments, reference is made to the accompanying drawings, which form a part hereof, and in which is shown, by way of illustration, various embodiments in which aspects of the disclosure may be practiced. In some instances other embodiments may be utilized, and structural and functional modifications may be made, without departing from the scope of the present disclosure.
It is noted that various connections between elements are discussed in the following description. It is noted that these connections are general and, unless specified otherwise, may be direct or indirect, wired or wireless, and that the specification is not intended to be limiting in this respect.
The following description relates to a system and method for multi image matching for outage prediction, prevention, and mitigation for technology infrastructure using a rules-based state machine, as is described further below. Preventing and predicting an outage for a technology infrastructure may be key to making sure that the backbone of the customer and employee facing applications run smoothly and avoid downtime. Outage prediction may involve not only looking at the current status of an overall system, but also evaluating a series of other events that might have led to the current status. To predict whether or not the current status is safe or may lead to some unsafe condition leading to an outage, the series of system statuses at various time intervals should be taken into consideration. Accordingly, described herein is the use of a state machine, configured to analyze images representing heatmaps corresponding to current system status.
Thermal images may capture the overall wellness and capacity of an infrastructure system. The thermal image may be created by starting with a table of raw telemetry data. This data may be normalized to convert each cell value between zero and one in floating point numbers. The resulting matrix may be a normalized image. Examples of this normalized image may be displayed by appropriate thresholding where a color is associated with each of the threshold ranges. Some examples of these normalized images are shown in normalized image 700, which is shown in
These normalized images represent the overall health of the system and may be directly attributed and linked to any events, incidents, and/or alerts generated. For example, normalized image 700 and normalized image 800 show two separate images of the overall system status at two different times. The heatmap or thermal images of different times may be considered to predict any potential outages, so that steps may be taken to mitigate or prevent potential outages. For example, diagram 900 of
In order to distinguish different series of patterns from one another, a rule based state machine (as depicted, for example, in diagram 1000 of
A simple state machine is depicted in diagram 1200 of
These and other features are described in greater details below.
Outage prediction and remediation platform 102 may include one or more computing devices and/or other computer components (e.g., processors, memories, communication interfaces, or the like). For example, the outage prediction and remediation platform 102 may be configured to generate, update, and/or otherwise maintain a state machine that includes a plurality of state machine images and the corresponding transitions between each of the plurality of state machine images. In some instances, the state machine may further include labels corresponding to a likelihood of failure for a given state machine image based on any linked images and the corresponding transitions. In some instances, the outage prediction and remediation platform 102 may be configured to perform image matching using the state machine to identify matching patterns of state images and their corresponding transitions over time. Based on the identified matching patterns, the outage prediction and remediation platform 102 may be configured to trigger preemptive resolution actions to avoid any predicted failures.
Telemetry information source 103 may be or include one or more computing devices (e.g., servers, server blades, or the like) and/or other computer components (e.g., processors, memories, communication interfaces, and/or other components). In some instances, the telemetry information source 103 may be configured to monitor a plurality of individual systems to collect the corresponding telemetry data. In other instances, the telemetry information source 103 may be the source of the telemetry data itself (e.g., producing the telemetry data). Although a single telemetry information source 103 is shown, any number of telemetry information sources 103 may be included in the system architecture without departing from the scope of the disclosure.
User device 104 may be or include one or more devices (e.g., laptop computers, desktop computer, smartphones, tablets, and/or other devices) configured for use in receiving preemptive resolution information from the outage prediction and remediation platform. In some instances, the user device 104 may be configured to display graphical user interfaces (e.g., preemptive resolution information, or the like). Any number of such user devices may be used to implement the techniques described herein without departing from the scope of the disclosure.
Computing environment 100 also may include one or more networks, which may interconnect outage prediction and remediation platform 102, telemetry information source 103, and user device 104. For example, computing environment 100 may include a network 101 (which may interconnect, e.g., outage prediction and remediation platform 102, telemetry information source 103, and user device 104).
In one or more arrangements, outage prediction and remediation platform 102, telemetry information source 103, and user device 104 may be any type of computing device capable of receiving a user interface, receiving input via the user interface, and communicating the received input to one or more other computing devices. For example, outage prediction and remediation platform 102, telemetry information source 103, user device 104, and/or the other systems included in computing environment 100 may, in some instances, be and/or include server computers, desktop computers, laptop computers, tablet computers, smart phones, or the like that may include one or more processors, memories, communication interfaces, storage devices, and/or other components. As noted above, and as illustrated in greater detail below, any and/or all of outage prediction and remediation platform 102, telemetry information source 103, and user device 104 may, in some instances, be special-purpose computing devices configured to perform specific functions.
Referring to
As a particular example, the state machine may be represented by the diagram 1200 in
Diagram 1300 depicts another example of such a state machine. For example, the state machine may be configured to identify a likelihood of failure and/or warning level (e.g., normal, watch, warning, medium alert, red alert, or the like) based on a progression of patterns between the telemetry state images and the corresponding transitions. For example, as shown in
Accordingly, the state machine may be configured to perform image comparison to the stored patterns, as well as the transitions between such patterns to predict a likelihood of failure. In some instances, the state machine may have labels associated with a warning level (e.g., normal, watch, warning, medium alert, red alert, or the like, which may e.g., be progressive in their corresponding likelihoods of failure), a likelihood of failure score (e.g., a score between zero and one hundred with zero being the least likelihood of failure and one hundred being the highest likelihood of failure, a color (e.g., green, yellow, red, or the like), and/or other indicator of a likelihood of failure. In some instances, these labels may be configured, input, and/or otherwise determined manually, semi-automatically, and/or automatically by the outage prediction and remediation platform 102.
In doing so, the outage prediction and remediation platform 102 may configure a state machine configured to consider both a current state of a system based on telemetry data, as well as the transition of the state over time. For example, a given state may be more concerning when it occurs after a first state than after a second state, or the like.
With further reference to
At step 203, the telemetry information source 103 may send initial telemetry data to the outage prediction and remediation platform 102. For example, the telemetry information source 103 may send time stamps, dates, system names, computer processing unit (CPU) information, memory information, and/or other telemetry information corresponding to performance of a plurality of systems (and/or the telemetry information source 103 itself). In some instances, the telemetry information source 103 may send the initial telemetry data while the first wireless data connection is established.
At step 204, the outage prediction and remediation platform 102 may receive the initial telemetry data sent at step 203. For example, the outage prediction and remediation platform 102 may receive the initial telemetry data via the communication interface 113 and while the first wireless data connection is established.
At step 205, the outage prediction and remediation platform 102 may normalize the initial telemetry data received at step 204. For example, the outage prediction and remediation platform 102 may convert the initial telemetry data (which may, e.g., include values of different sizes, ranges, or the like) to values between zero and one. In doing so, the outage prediction and remediation platform 102 may configure the initial telemetry data for representation as an initial telemetry state image.
Referring to
In some instances, in generating the initial telemetry state image, the outage prediction and remediation platform 102 may apply one or more thresholding techniques. For example, as a simple example, the outage prediction and remediation platform 102 may use green to represent any values from 0-3 (inclusive), yellow to represent any values from 3.1-6 (inclusive), and red to represent any values from 6.1-10 (inclusive). Any number of colors and/or threshold ranges may be implemented without departing from the scope of the disclosure.
At step 207, the outage prediction and remediation platform 102 may use one or more image matching techniques to identify a telemetry state image in the state machine that matches the initial telemetry state image. In some instances, the outage prediction and remediation platform 102 may identify an exact match. In other instances, the outage prediction and remediation platform 102 may identify a threshold match (e.g., at least a threshold level match). In some instances, the outage prediction and remediation platform 102 may identify a likelihood of failure and/or warning level corresponding to the matching image in the state machine, and may output an indication and/or take actions accordingly.
At step 208, the telemetry information source 103 may send additional telemetry data to the outage prediction and remediation platform 102. For example, the telemetry information source 103 may send telemetry data similar to the telemetry data sent at step 203, but which may correspond to a later time. In some instances, the telemetry information source 103 may send the additional telemetry data to the outage prediction and remediation platform 102 while the first wireless data connection is established.
At step 209, the outage prediction and remediation platform 102 may receive the additional telemetry data sent at step 208. For example, the outage prediction and remediation platform 102 may receive the additional telemetry data from the telemetry information source 103 via the communication interface 113 and while the first wireless data connection is established.
At step 210, the outage prediction and remediation platform 102 may normalize the additional telemetry data. For example, the outage prediction and remediation platform 102 may perform actions similar to those described above at step 205 with regard to the initial telemetry data.
Referring to
At step 212, the outage prediction and remediation platform 102 may identify a matching image for the additional telemetry state image using the state machine. For example, the outage prediction and remediation platform 102 may perform actions similar to those described above at step 207 with regard to identifying a machine image for the initial telemetry state image. In some instances, in identifying the matching image for the additional telemetry state image, the outage prediction and remediation platform 102 may identify a matching pattern, corresponding to the initial telemetry state image, the additional telemetry state image, and the transition between them. For example, in referring to diagram 1200 of
At step 213, the outage prediction and remediation platform 102 may identify, using the state machine, a likelihood of failure and/or warning. For example, the outage prediction and remediation platform 102 may identify a likelihood of failure and/or warning that corresponds to a progression from the initial telemetry state image to the additional telemetry state image. For example, the state machine may have been pre-configured (e.g., at step 201) with the likelihood of failure and/or warning corresponding to these patterns and the corresponding transition. In some instances the outage prediction and remediation platform 102 may identify a numeric score representing a likelihood of failure. Additionally or alternatively, the outage prediction and remediation platform 102 may identify a warning level, indicating a severity and/or imminence of failure.
In identifying the likelihood of failure, the outage prediction and remediation platform 102 may identify a likelihood of failure corresponding to the additional telemetry state image, when taking into account the progression from the initial telemetry state image to the additional telemetry state image. For example, the likelihood of failure of the additional telemetry state image may vary depending on the progression of images leading up to it.
At step 214, the outage prediction and remediation platform 102 may compare the likelihood of failure to one or more failure thresholds. In some instances, the failure thresholds may represent numeric values (e.g., against which numeric representations of the likelihood of failure may be compared), warning thresholds (e.g., a particular warning label in a series of warning labels, increasing in severity, against which such likelihood of failure warning labels may be compared), and/or otherwise. In some instances, if the outage prediction and remediation platform 102 identifies that the likelihood of failure meets or exceeds the threshold, the outage prediction and remediation platform 102 may proceed to step 215. Otherwise, if the outage prediction and remediation platform 102 identifies that the likelihood of failure does not meet or exceed the threshold, the outage prediction and remediation platform 102 may proceed to step 219.
Referring to
At step 216, the outage prediction and remediation platform 102 may send one or more preemptive resolution commands to the user device 104. For example, the outage prediction and remediation platform 102 may, in some instances, identify, using information stored in the state machine and corresponding to the telemetry state machine images identified as matching the initial telemetry state machine image, additional telemetry state machine images, and the corresponding transitions, one or more actions used to resolve the failure (which, in the example of the telemetry state machine images of the state machine may have actually occurred, but may, in the example of the initial/additional telemetry state machine images be predicted to occur). Accordingly, the outage prediction and remediation platform 102 may effectively identify, based on previously performed corrective actions for a given failure, actions that may be performed to preemptively avoid the failure (which may, e.g., be predicted to occur).
In some instances, the outage prediction and remediation platform 102 may identify a confidence level corresponding to the likelihood of failure. In some instances, this may be based on a matching level identified by the outage prediction and remediation platform 102 corresponding to the initial/additional telemetry state machine images and the telemetry state machine images stored in the state machine. Additionally or alternatively, this may be based on a confidence that the identify remediation action will preemptively avoid the predicted failure.
In some instances, the outage prediction and remediation platform 102 may identify that the confidence level fails to meet or exceed a first confidence threshold. In these instances, the outage prediction and remediation platform 102 may send a graphical user interface similar to graphical user interface 400, which is shown in
In some instances, the outage prediction and remediation platform 102 may identify that the confidence level meets or exceeds the first confidence threshold, but fails to meet or exceed a second confidence threshold (which may be higher than the first confidence threshold). In these instances, the outage prediction and remediation platform 102 may send a graphical user interface similar to graphical user interface 500, which is shown in
In some instances, the outage prediction and remediation platform 102 may identify that the confidence level meets or exceeds the second confidence threshold. In these instances, the outage prediction and remediation platform 102 may send a graphical user interface similar to graphical user interface 600, which is shown in
At step 217, the user device 104 may receive the preemptive resolution commands sent at step 216. For example, the user device 104 may receive the preemptive resolution commands while the second wireless data connection is established.
At step 218, based on or in response to the one or more preemptive resolution commands, the user device 104 may display a pre-emptive resolution interface (e.g., similar to graphical user interface 400 of
At step 219, the outage prediction and remediation platform 102 may update the state machine based on the initial telemetry state image, the additional telemetry state image, the corresponding transition, an identified likelihood of failure, an identified remediating action, and/or other information. In doing so, the outage prediction and remediation platform 102 may continue to refine the state machine using a dynamic feedback loop, which may, e.g., increase the accuracy and effectiveness of the state machine in predicting and remediating potential system failures.
For example, the outage prediction and remediation platform 102 may use the initial telemetry state image, the additional telemetry state image, the corresponding transition, an identified likelihood of failure, an identified remediating action, and/or other information to reinforce, modify, and/or otherwise update the state machine, thus causing the state machine to continuously improve (e.g., in terms of predicting and remediating system failures).
In some instances, the outage prediction and remediation platform 102 may continuously refine any and/or all the state machine. In some instances, the outage prediction and remediation platform 102 may maintain an accuracy threshold for the state machine, and may pause refinement (through the dynamic feedback loops) of the state machine if the corresponding accuracy is identified as greater than the corresponding accuracy threshold. Similarly, if the accuracy fails to be equal or less than the given accuracy threshold, the outage prediction and remediation platform 102 may resume refinement of the state machine through the corresponding dynamic feedback loop.
Although only initial and one instance of additional telemetry data are described herein, this is for illustrative purposes only, and any number of additional rounds of telemetry data may be received and compared against the state machine using similar techniques to those described above. For example, as illustrated in
Furthermore, although the use of a state machine is primarily described, in some instances, alternative techniques, such as the use of a machine learning and/or artificial intelligence model may be used to produce similar results without departing from the scope of the disclosure. Furthermore, although the analysis of system telemetry data is primarily described, the methods described above may be used to analyze other types of information (e.g., application performance information, or the like) for failure prevention without departing from the scope of the disclosure.
One or more aspects of the disclosure may be embodied in computer-usable data or computer-executable instructions, such as in one or more program modules, executed by one or more computers or other devices to perform the operations described herein. Generally, program modules include routines, programs, objects, components, data structures, and the like that perform particular tasks or implement particular abstract data types when executed by one or more processors in a computer or other data processing device. The computer-executable instructions may be stored as computer-readable instructions on a computer-readable medium such as a hard disk, optical disk, removable storage media, solid-state memory, RAM, and the like. The functionality of the program modules may be combined or distributed as desired in various embodiments. In addition, the functionality may be embodied in whole or in part in firmware or hardware equivalents, such as integrated circuits, application-specific integrated circuits (ASICs), field programmable gate arrays (FPGA), and the like. Particular data structures may be used to more effectively implement one or more aspects of the disclosure, and such data structures are contemplated to be within the scope of computer executable instructions and computer-usable data described herein.
Various aspects described herein may be embodied as a method, an apparatus, or as one or more computer-readable media storing computer-executable instructions. Accordingly, those aspects may take the form of an entirely hardware embodiment, an entirely software embodiment, an entirely firmware embodiment, or an embodiment combining software, hardware, and firmware aspects in any combination. In addition, various signals representing data or events as described herein may be transferred between a source and a destination in the form of light or electromagnetic waves traveling through signal-conducting media such as metal wires, optical fibers, or wireless transmission media (e.g., air or space). In general, the one or more computer-readable media may be and/or include one or more non-transitory computer-readable media.
As described herein, the various methods and acts may be operative across one or more computing servers and one or more networks. The functionality may be distributed in any manner, or may be located in a single computing device (e.g., a server, a client computer, and the like). For example, in alternative embodiments, one or more of the computing platforms discussed above may be combined into a single computing platform, and the various functions of each computing platform may be performed by the single computing platform. In such arrangements, any and/or all of the above-discussed communications between computing platforms may correspond to data being accessed, moved, modified, updated, and/or otherwise used by the single computing platform. Additionally or alternatively, one or more of the computing platforms discussed above may be implemented in one or more virtual machines that are provided by one or more physical computing devices. In such arrangements, the various functions of each computing platform may be performed by the one or more virtual machines, and any and/or all of the above-discussed communications between computing platforms may correspond to data being accessed, moved, modified, updated, and/or otherwise used by the one or more virtual machines.
Aspects of the disclosure have been described in terms of illustrative embodiments thereof. Numerous other embodiments, modifications, and variations within the scope and spirit of the appended claims will occur to persons of ordinary skill in the art from a review of this disclosure. For example, one or more of the steps depicted in the illustrative figures may be performed in other than the recited order, and one or more depicted steps may be optional in accordance with aspects of the disclosure.
This application is related to U.S. Application Ser. No. ______, filed May 17, 2023, and entitled “System and Method for Multi Image Matching for Outage Prediction, Prevention, and Mitigation for Technology Infrastructure Using Hybrid Deep Learning.” which is incorporated herein by reference in its entirety.