The field relates generally to information processing systems, and more particularly to device and component management in such information processing systems.
When devices or components thereof malfunction, a support case may be opened and shared with technical support personnel. The support case may be assigned a priority level based on the detected malfunction. In some cases, if the priority level does not indicate a serious issue, there may some delay before support personnel attend to the problem. As a result of the delay, the severity of the case may escalate to a point where the affected devices or components are rendered inoperable and unable to be recovered before the issue is addressed.
Embodiments provide a state prediction and failure prevention platform in an information processing system.
For example, in one embodiment, a method comprises receiving data corresponding to operation of a plurality of elements, wherein the plurality of elements comprise at least one of a plurality of devices and a plurality of device components. The data corresponding to the operation of the plurality of elements comprises one or more operational states for respective ones of the plurality of elements. Using one or more machine learning algorithms, a future operational state of one or more elements of the plurality of elements is predicted. The prediction is based, at least in part, on the data corresponding to the operation of the plurality of elements. Using the one or more machine learning algorithms, one or more actions to prevent the one or more elements from transitioning to the future operational state are identified.
Further illustrative embodiments are provided in the form of a non-transitory computer-readable storage medium having embodied therein executable program code that when executed by a processor causes the processor to perform the above steps. Still further illustrative embodiments comprise an apparatus with a processor and a memory configured to perform the above steps.
These and other features and advantages of embodiments described herein will become more apparent from the accompanying drawings and the following detailed description.
Illustrative embodiments will be described herein with reference to exemplary information processing systems and associated computers, servers, storage devices and other processing devices. It is to be appreciated, however, that embodiments are not restricted to use with the particular illustrative system and device configurations shown. Accordingly, the term “information processing system” as used herein is intended to be broadly construed, so as to encompass, for example, processing systems comprising cloud computing and storage systems, as well as other types of processing systems comprising various combinations of physical and virtual processing resources. An information processing system may therefore comprise, for example, at least one data center or other type of cloud-based system that includes one or more clouds hosting tenants that access cloud resources. Such systems are considered examples of what are more generally referred to herein as cloud-based computing environments. Some cloud infrastructures are within the exclusive control and management of a given enterprise, and therefore are considered “private clouds.” The term “enterprise” as used herein is intended to be broadly construed, and may comprise, for example, one or more businesses, one or more corporations or any other one or more entities, groups, or organizations. An “entity” as illustratively used herein may be a person or system. On the other hand, cloud infrastructures that are used by multiple enterprises, and not necessarily controlled or managed by any of the multiple enterprises but rather respectively controlled and managed by third-party cloud providers, are typically considered “public clouds.” Enterprises can choose to host their applications or services on private clouds, public clouds, and/or a combination of private and public clouds (hybrid clouds) with a vast array of computing resources attached to or otherwise a part of the infrastructure. Numerous other types of enterprise computing and storage systems are also encompassed by the term “information processing system” as that term is broadly used herein.
As used herein, “real-time” refers to output within strict time constraints. Real-time output can be understood to be instantaneous or on the order of milliseconds or microseconds. Real-time output can occur when the connections with a network are continuous and a user device receives messages without any significant time delay. Of course, it should be understood that depending on the particular temporal nature of the system in which an embodiment is implemented, other appropriate timescales that provide at least contemporaneous performance and output can be achieved.
In a datacenter or other computing environment, systems management applications 105 monitor user devices 102 uninterruptedly and can provide remedial measures to address device and/or component issues identified in alerts received from the user devices 102. When an alert from a user device 102 is generated, the operational details of a faulty component may be collected by the systems management application 105 at the time of occurrence of the issue or failure of the component. A support case with this operational data and appropriate priority levels may be shared with technical support personnel via, for example, the one or more technical support devices 107.
The user devices 102, technical support devices 107 and devices on which the systems management applications 105 are executed comprise, for example, desktop, laptop or tablet computers, servers, host devices, storage devices, mobile telephones, Internet of Things (IoT) devices or other types of processing devices capable of communicating with the state prediction and failure prevention platform 110 over the network 104. Such devices are examples of what are more generally referred to herein as “processing devices.” Some of these processing devices are also generally referred to herein as “computers.” The user devices 102, technical support devices 107 and devices on which the systems management applications 105 are executed may also or alternately comprise virtualized computing resources, such as virtual machines (VMs), containers, etc. The user devices 102, technical support devices 107 and devices on which the systems management applications 105 are executed in some embodiments comprise respective computers associated with a particular company, organization or other enterprise.
The terms “user,” “customer,” “client,” “personnel” or “administrator” herein are intended to be broadly construed so as to encompass numerous arrangements of human, hardware, software or firmware entities, as well as combinations of such entities. State prediction and failure prevention services may be provided for users utilizing one or more machine learning models, although it is to be appreciated that other types of infrastructure arrangements could be used. At least a portion of the available services and functionalities provided by the state prediction and failure prevention platform 110 in some embodiments may be provided under Function-as-a-Service (“FaaS”), Containers-as-a-Service (“CaaS”) and/or Platform-as-a-Service (“PaaS”) models, including cloud-based FaaS, CaaS and PaaS environments.
Although not explicitly shown in
In some embodiments, the user devices 102, technical support devices 107 and/or devices on which the systems management applications 105 are executed are assumed to be associated with repair and/or support technicians, system administrators, information technology (IT) managers, software developers, release management personnel or other authorized personnel configured to access and utilize the state prediction and failure prevention platform 110.
The state prediction and failure prevention platform 110 in the present embodiment is assumed to be accessible to the user devices 102, technical support devices 107 and devices on which the systems management applications 105 are executed and vice versa over the network 104. The network 104 is assumed to comprise a portion of a global computer network such as the Internet, although other types of networks can be part of the network 104, including a wide area network (WAN), a local area network (LAN), a satellite network, a telephone or cable network, a cellular network, a wireless network such as a WiFi or WiMAX network, or various portions or combinations of these and other types of networks. The network 104 in some embodiments therefore comprises combinations of multiple different types of networks each comprising processing devices configured to communicate using Internet Protocol (IP) or other related communication protocols.
As a more particular example, some embodiments may utilize one or more high-speed local networks in which associated processing devices communicate with one another utilizing Peripheral Component Interconnect express (PCIe) cards of those devices, and networking protocols such as InfiniBand, Gigabit Ethernet or Fibre Channel. Numerous alternative networking arrangements are possible in a given embodiment, as will be appreciated by those skilled in the art.
Referring to
According to one or more embodiments, operational data (e.g., hardware component and firmware information) of the user devices 102 and/or components 103 can be recorded automatically by the systems management applications 105 and/or the data collection engine 120. For example, the operational data can be recorded at pre-defined time intervals set by, for example, the systems management applications 105. The systems management applications 105 may comprise, for example, one or more data collection applications such as, but not necessarily limited to, SupportAssist Enterprise available from Dell Technologies. In illustrative embodiments, the systems management applications 105 and/or the data collection engine 120 collect operational data from the user devices 102 and/or the components 103 by tracking service requests, through scheduled collections at designated times and/or through event-based collections. In some embodiments, the data collection engine 120 receives pushed data or collects data from the user devices 102, components 103 and/or the systems management applications 105.
When service requests for repair or to address issues corresponding to given ones of the user devices 102 or components 103 are initiated, the systems management applications 105 and/or data collection engine 120 processes the service requests and collects operational data associated with the subject user device 102 and/or components 103 identified in the service request. Scheduled collections occur at pre-defined times or intervals specified by, for example, a user via one or more user devices 102 or automatically scheduled by the systems management applications 105 and/or data collection engine 120. Event-based collections are triggered by one or more events such as, but necessarily limited to, component or device failure, a detected degradation of performance of a component or device, installation of new software or firmware, the occurrence of certain operations, etc. In some embodiments, an integrated Dell® remote access controller (iDRAC) causes the data collection engine 120 and/or systems management applications 105 to collect operational data from one or more user devices 102 and/or components 103 and to export the collected data to a location (e.g., database or cache) on the state prediction and failure prevention platform 110 or to a shared network location (e.g., centralized database). In some embodiments, the operational data is stored in a portion of the database 150.
In some situations, the operational data is supplied to support teams via one or more technical support devices 107. The operational data helps support personnel resolve issues by providing the support teams with collected logs describing the reason of failure in a particular component 103 or in a user device 102 as a whole.
In an operational example, a component (e.g., hard disk drive (HDD)) of a server reports a performance issue. This is “State A” of the device and operational data related to the HDD is collected. With these details, a systems management application 105 opens a case with a priority level of, for example, “medium” and is transmits the case to technical support personnel at a time (t1) (e.g., via technical support device 107). With current approaches, depending on the assigned priority level of the case, it takes time for the support team to provide a resolution for the performance issue. In the meantime, the HDD may stop working (“State B”) and another alert would be generated. The related details for State B are captured as operational data at time (t2). By the time the operational data is again collected at time (t3), severity of the case has increased, and there may be instances where the server fails to perform vital functions (e.g., fails to load the operating system (OS) and is rendered non-responsive) (“State C”). State C represents a critical failure and a “critical” priority case is transmitted to a technical support team. At this time (t3), the data availability of the data in the server is at risk, and the data may not be recovered at this point.
The illustrative embodiments attempt to address the above problems by providing techniques which use machine learning to mitigate the risk of data unavailability and commence mitigation steps well in advance of issue escalation. With conventional approaches, based on the severity (e.g., priority) of a support ticket, there is a turnaround time (yl) required for a technical support team to analyze an issue for an impacted device, perform mitigation steps, and verify and close an issue. Within that time yl, the impacted device may transition to another more critical state. However, the technical support team may be unaware of the most recent state and perform mitigation steps based on the ticket raised at time t1. Such a scenario can cause the impacted device to reach an unrecoverable state before the issue is resolved or, in some cases, before the issue is addressed.
The illustrative embodiments attempt to avoid this scenario by providing techniques which use machine learning to predict: (i) the next state of a user device 102 or component 103; (ii) the time it will take for the user device 102 or component 103 to reach the next state; (iii) mitigation steps to prevent transition to the next state; and (iv) how much time is needed and when to perform the predicted mitigation steps.
The illustrative embodiments advantageously use historical data about previous device and component issues to train a machine learning model to predict future device or component states and mitigation steps to prevent the progression to the future states well in advance of escalation of the device or component issues. As an additional advantage, the embodiments predict the time it will take to complete the predicted mitigation steps, such that a user can be timely informed and the device where the issue is raised does not reach a critical and/or unworkable state. As explained in more detail herein, in some embodiments, support tickets are used as a source for previously implemented mitigation steps. When a ticket is resolved by a technical support agent, the resolution steps are referenced in the ticket. A decision tree is generated in accordance with historical mitigation steps.
Referring to the operational flow 200 in
The operational data including logs for various device and component operations is collected by the systems management applications 105 and/or the data collection engine 120, for example, at periodic intervals, in connection with support tickets and/or when requested by a user. The collected data provides details of the states of user devices 102 and components 103 over different time periods. The data sets when collated, provide present operational states, operational states that the user devices 102 and/or components 103 have transitioned to (e.g., before and during resolution of a support ticket), and the duration to transition to the other states.
For example,
Referring to block 232 of
In an operational example using Markov analysis, when an alert is received by a faulty one of the components 103 or user device 102, the future state is predicted without depending on the details on how the component 103 or user device 102 reached its present state. For example, referring to the Markov chain diagram 600 in
The state of a Markov chain is the value of Xt at time t. S represents the state space of a Markov chain (e.g., the possible values of Xt). The Markov chain analysis can be represented by the following formula (1):
P(Xt+1=S|Xt=St,Xt−1=St−1, . . . X0=S0)=P(Xt+1=S|Xt=St) (1)
for all t=1, 2, 3, . . . etc. and for all device states S0, S1, S2, . . . St.
To determine the probabilities of the next component or device states, a transition matrix in accordance with the Markov chain is generated. Referring to
Referring to formula (2) below, matrix P lists all the possible device states in the state space S. The matrix is a square matrix because its values are in the same state space S of size N.
p
ij
=P(Xt+1=j|Xt=i) (2)
where probability of transitioning from state i to state j is the conditional probability for Next j, given Now i for the entry (i, j).
Referring to block 233 of
In more detail, based on probability ratios derived using the Markov chain, the future device states are predicted by the second level state prediction layer 133. Using, for example, Mondrian conformal prediction, the second level state prediction layer 133 splits the dataset containing the probabilities of the transition of the device and component states into three datasets, namely training, calibration and test datasets. A random forest with multiple trees (e.g., ≥100 trees) is used for calculating a conformity measure of the predicted state probabilities.
Referring to block 234 of
To determine the mitigation steps for an issue, a decision tree classifier used. A decision tree classifier is trained with the data corresponding to the operation of the user devices 102 and components 103 using a supervised learning technique. The tree-structured classifier evaluates problems through a graphical representation to obtain multiple possible solutions based on a given condition.
There can be multiple mitigation steps involved in the resolution of an issue. Referring to the decision tree 800 in
The table 900 in
The table 900 depicts sample data collected from issues raised for various user devices 102 and their corresponding components 103 for different types of issues or failures over a period of time, and lists the mitigation steps taken by the technical support agents to resolve the issues or failures.
In the columns taken in order from left to right, model corresponds a model of the user device 102 where the issue was seen, component corresponds to a component 103 of the user device 102 where the issue was seen, issue reported date corresponds to the date when the issue was reported to technical support agents, operational data-based issue seen date corresponds to the date when the issue was seen after analyzing the operational data, issue corresponds to the actual issue reported, mitigation steps correspond to the steps followed by technical support personnel to mitigate the issue (e.g., prevent the occurrence of the next state), issue closure date corresponds to the date when the issue was closed by technical support, operational data-based issue closure date corresponds to the date when the issue ceased to be reported after collating the operational data, previous state corresponds to a previous state of the device while the current issue was reported after collating the data, and next state corresponds to the next possible state of the user device 102 or component 103 with respect to the reported (e.g., current) issue.
The target variable for the decision tree is mitigation steps. To use the above data as training data and build the decision tree, the Gini index is computed for each of the features in accordance with the following formula (3).
Gini Index=1−ΣjPj2 (3)
P is the ratio of a class at the ith node. The Gini index determines the purity or impurity of a feature at the time of creating a decision tree. In more detail, the Gini index is a measurement of the likelihood of an incorrect classification of a new instance of a random variable, if that new instance were randomly classified according to the distribution of class labels from the data set. In other words, is used to determine optimal splitting of a decision tree. To determine the root node for the decision tree, the mitigation layer 134 computed the weighted average of the Gini index for each of the features (e.g., the columns in the table 900). Once the Gini index of all the features is determined, the feature with the lowest Gini index is assigned as the root node. Once the root node is determined, the process is repeated for each leaf node to construct the decision tree with a mapping based on the Gini indexes.
For example, in computing the Gini index for the “model” feature, the mitigation steps and the number of occurrences of each mitigation step are considered. Referring to the table 1000 in
Therefore, the weighted average Gini Index for the “model” feature is 0.4166667. Similarly, the mitigation layer 134 computes the weighted average Gini index for the remaining features. The feature with the lowest Gini index is assigned as the root node of the decision tree. Assuming after the computations that “model” feature results in the lowest Gini index, Model is assigned as the root node, and a decision tree like the decision tree 1200 in
The decision tree 1200 includes mappings and classifications for the previous and next states associated with respective current states. The mitigation layer 134 uses the generated decision tree (e.g., decision tree 1200) to identify the mitigation steps for a particular issue based on the current, next and previous states associated with the issue.
Additionally, the time to start and perform the mitigation steps is determined based on the generated decision tree and the operational data. For example, the operational data-based issue seen date and the closure date is used to find an estimated time that was consumed to address an issue for a particular user device 102 or component 103 with the given mitigation steps.
In more detail, estimated duration to resolve an issue (t)=(operational data-based issue closure date−operational data-based issue seen date). Based on this estimated duration “t” required to complete the mitigation steps while in the current state, referring to block 240 in
Referring to the table 1300 in
Referring to the timeline diagram 1402 in
=t+y−t−(y−2)
=t+y−t−y+2
=2
For the above calculation, it is determined that two days is the optimal time or time left for the user to start the mitigation steps (e.g., Threshold Time). If the mitigation steps are not performed within the two days, the next state may be reached, resulting in failure or further degradation of the user device 102.
In the timeline diagrams 1401 and 1402, the time from t to t+y represents to time to transition from the current state to the next state. Based on the computations described above, the time represented by t to t+2 is the Threshold Time, identifying when the mitigation steps should be started by the user. By using the machine algorithms to predict the mitigation steps in real-time or at the very least well in advance of progressing to the next state, the embodiments permit a user to be informed about the predicted mitigation steps before the Threshold Time runs out and the user device 102 and/or component 103 transitions to the next state.
Using the one or more machine learning algorithms, the mitigation layer 134 identifies one or more actions (e.g., mitigation steps) to prevent the user devices 102 and/or components 103 from transitioning to a future operational state, and the time period in which to perform the one or more actions. A report generation layer 141 of the output engine 140 generates instructions comprising the one or more actions and identified time period, and causes transmission of the instructions to one or more user devices 102 and/or technical support devices 107 over network 104. In one or more embodiments, upon receipt of the instructions, the user devices 102 automatically implement one or more of the mitigation steps that can be performed by automated means such as, for example, initiating a dispatch, initiating a data migration, upgrading software and/or shutting down resource intensive services.
According to one or more embodiments, the database 150 and other data repositories or databases referred to herein can be configured according to a relational database management system (RDBMS) (e.g., PostgreSQL). In some embodiments, the database 150 and other data repositories or databases referred to herein are implemented using one or more storage systems or devices associated with the state prediction and failure prevention platform 110. In some embodiments, one or more of the storage systems utilized to implement the database 150 and other data repositories or databases referred to herein comprise a scale-out all-flash content addressable storage array or other type of storage array.
The term “storage system” as used herein is therefore intended to be broadly construed, and should not be viewed as being limited to content addressable storage systems or flash-based storage systems. A given storage system as the term is broadly used herein can comprise, for example, network-attached storage (NAS), storage area networks (SANs), direct-attached storage (DAS) and distributed DAS, as well as combinations of these and other storage types, including software-defined storage.
Other particular types of storage products that can be used in implementing storage systems in illustrative embodiments include all-flash and hybrid flash storage arrays, software-defined storage products, cloud storage products, object-based storage products, and scale-out NAS clusters. Combinations of multiple ones of these and other storage products can also be used in implementing a given storage system in an illustrative embodiment.
Although shown as elements of the state prediction and failure prevention platform 110, the data collection engine 120, state prediction and mitigation engine 130, output engine 140 and/or database 150 in other embodiments can be implemented at least in part externally to the state prediction and failure prevention platform 110, for example, as stand-alone servers, sets of servers or other types of systems coupled to the network 104. For example, the data collection engine 120, state prediction and mitigation engine 130, output engine 140 and/or database 150 may be provided as cloud services accessible by the state prediction and failure prevention platform 110.
The data collection engine 120, state prediction and mitigation engine 130, output engine 140 and/or database 150 in the
At least portions of the state prediction and failure prevention platform 110 and the elements thereof may be implemented at least in part in the form of software that is stored in memory and executed by a processor. The state prediction and failure prevention platform 110 and the elements thereof comprise further hardware and software required for running the state prediction and failure prevention platform 110, including, but not necessarily limited to, on-premises or cloud-based centralized hardware, graphics processing unit (GPU) hardware, virtualization infrastructure software and hardware, Docker containers, networking software and hardware, and cloud infrastructure software and hardware.
Although the data collection engine 120, state prediction and mitigation engine 130, output engine 140, database 150 and other elements of the state prediction and failure prevention platform 110 in the present embodiment are shown as part of the state prediction and failure prevention platform 110, at least a portion of the data collection engine 120, state prediction and mitigation engine 130, output engine 140, database 150 and other elements of the state prediction and failure prevention platform 110 in other embodiments may be implemented on one or more other processing platforms that are accessible to the state prediction and failure prevention platform 110 over one or more networks. Such elements can each be implemented at least in part within another system element or at least in part utilizing one or more stand-alone elements coupled to the network 104.
It is assumed that the state prediction and failure prevention platform 110 in the
The term “processing platform” as used herein is intended to be broadly construed so as to encompass, by way of illustration and without limitation, multiple sets of processing devices and one or more associated storage systems that are configured to communicate over one or more networks.
As a more particular example, the data collection engine 120, state prediction and mitigation engine 130, output engine 140, database 150 and other elements of the state prediction and failure prevention platform 110, and the elements thereof can each be implemented in the form of one or more LXCs running on one or more VMs. Other arrangements of one or more processing devices of a processing platform can be used to implement the data collection engine 120, state prediction and mitigation engine 130, output engine 140 and database 150, as well as other elements of the state prediction and failure prevention platform 110. Other portions of the system 100 can similarly be implemented using one or more processing devices of at least one processing platform.
Distributed implementations of the system 100 are possible, in which certain elements of the system reside in one data center in a first geographic location while other elements of the system reside in one or more other data centers in one or more other geographic locations that are potentially remote from the first geographic location. Thus, it is possible in some implementations of the system 100 for different portions of the state prediction and failure prevention platform 110 to reside in different data centers. Numerous other distributed implementations of the state prediction and failure prevention platform 110 are possible.
Accordingly, one or each of the data collection engine 120, state prediction and mitigation engine 130, output engine 140, database 150 and other elements of the state prediction and failure prevention platform 110 can each be implemented in a distributed manner so as to comprise a plurality of distributed elements implemented on respective ones of a plurality of compute nodes of the state prediction and failure prevention platform 110.
It is to be appreciated that these and other features of illustrative embodiments are presented by way of example only, and should not be construed as limiting in any way. Accordingly, different numbers, types and arrangements of system elements such as the data collection engine 120, state prediction and mitigation engine 130, output engine 140, database 150 and other elements of the state prediction and failure prevention platform 110, and the portions thereof can be used in other embodiments.
It should be understood that the particular sets of modules and other elements implemented in the system 100 as illustrated in
For example, as indicated previously, in some illustrative embodiments, functionality for the state prediction and failure prevention platform can be offered to cloud infrastructure customers or other users as part of FaaS, CaaS and/or PaaS offerings.
The operation of the information processing system 100 will now be described in further detail with reference to the flow diagram of
In step 1502, data corresponding to operation of a plurality of elements is received. The plurality of elements comprise at least one of a plurality of devices and a plurality of device components. The data corresponding to the operation of the plurality of elements comprises one or more operational states for respective ones of the plurality of elements. In step 1504, using one or more machine learning algorithms, a future operational state of one or more elements of the plurality of elements is predicted based, at least in part, on the data corresponding to the operation of the plurality of elements. In step 1506, using the one or more machine learning algorithms, one or more actions to prevent the one or more elements from transitioning to the future operational state are identified. Using the one or more machine learning algorithms, a time period in which to perform the one or more actions is identified. The one or more actions and the time period in which to perform the one or more actions are transmitted to at least one user device.
In illustrative embodiments, the data corresponding to the operation of the plurality of elements comprises a plurality of log entries. The data in the plurality of log entries is collated based, at least in part, on one or more of changes in the one or more operational states, durations of the one or more operational states before changing to a different operational state and whether the changes in the one or more operational states resulted in failure of one or more of the plurality of elements. The data in the plurality of log entries may also be collated based, at least in part, on one or more of device type, device component type, alerts indicating one or more issues with the plurality of elements and severity of the alerts.
In illustrative embodiments, predicting the future operational state of the one or more elements comprises using a stochastic model to predict respective probabilities of one or more future operational states of the one or more elements based on a most recent known operational state of the one or more elements. A matrix in accordance with the stochastic model is generated, wherein the matrix comprises one or more rows corresponding to the most recent known operational state of the one or more elements and one or more columns corresponding to the one or more future operational states of the one or more elements.
Predicting the future operational state of the one or more elements further comprises using a conformal prediction model to predict the future operational state of the one or more elements based on the respective probabilities of the one or more future operational states. A random forest model is used to compute a conformity value of the respective probabilities of the one or more future operational states.
Identifying the one or more actions comprises using the data corresponding to the operation of the plurality of elements as training data to generate a decision tree. Generating the decision tree comprises computing respective Gini indexes for respective ones of a plurality of features in the data corresponding to the operation of the plurality of elements. A root node of the decision tree is identified based, at least in part, on the respective Gini indexes. For at least one feature of the plurality of features, a computed Gini index is based, at least in part, on a plurality of actions that were performed to prevent at least a portion of the plurality of elements from transitioning to a plurality of future operational states, and a number of times respective ones of the plurality of actions were performed.
The one or more machine learning algorithms are trained with at least a portion of the data corresponding to the operation of the plurality of elements and comprise, for example, a random forest machine learning algorithm.
It is to be appreciated that the
The particular processing operations and other system functionality described in conjunction with the flow diagram of
Functionality such as that described in conjunction with the flow diagram of
Illustrative embodiments of systems with a state prediction and failure prevention platform as disclosed herein can provide a number of significant advantages relative to conventional arrangements. For example, the state prediction and failure prevention platform effectively uses machine learning techniques to predict future states of devices and components, which may lead to system failure. As an additional advantage, the embodiments provide techniques for using a stochastic model to predict, with minimal turnaround time, probabilities of future states. Based on the probabilities, conformal prediction techniques are leveraged to ascertain a next state a device or component would transition to from a current state if an issue is not resolved. At this stage a more accurate set of future device states is obtained. As a result, the embodiments enable more efficient use of compute resources, improve performance and reduce bottlenecks. For example, even though logs may be voluminous, the machine learning techniques implemented by the embodiments enable immediate (e.g., real-time) identification of issues in response to receipt of operational data.
The embodiments advantageously use machine learning algorithms to evaluate the operational data to predict device and component states. Unlike conventional techniques, the embodiments provide a framework for proactively predicting and alerting users of upcoming component and device states by analyzing current and previous states in operational logs. As an additional advantage, unlike current approaches, once an accurate set of future device or component states is obtained, the embodiments provide technical solutions to use machine learning to generate mitigation steps to prevent the transition to the future states. Additionally, an optimal time to start and complete performance of the mitigation steps is determined. A user is notified with the current and predicted future device or component states, the mitigation steps and the optimal time to start and perform the mitigation steps so that timely action can be taken to prevent unwanted or catastrophic failure or degradation of devices and components.
It is to be appreciated that the particular advantages described above and elsewhere herein are associated with particular illustrative embodiments and need not be present in other embodiments. Also, the particular types of information processing system features and functionality as illustrated in the drawings and described above are exemplary only, and numerous other arrangements may be used in other embodiments.
As noted above, at least portions of the information processing system 100 may be implemented using one or more processing platforms. A given such processing platform comprises at least one processing device comprising a processor coupled to a memory. The processor and memory in some embodiments comprise respective processor and memory elements of a virtual machine or container provided using one or more underlying physical machines. The term “processing device” as used herein is intended to be broadly construed so as to encompass a wide variety of different arrangements of physical processors, memories and other device components as well as virtual instances of such components. For example, a “processing device” in some embodiments can comprise or be executed across one or more virtual processors. Processing devices can therefore be physical or virtual and can be executed across one or more physical or virtual processors. It should also be noted that a given virtual device can be mapped to a portion of a physical one.
Some illustrative embodiments of a processing platform that may be used to implement at least a portion of an information processing system comprise cloud infrastructure including virtual machines and/or container sets implemented using a virtualization infrastructure that runs on a physical infrastructure. The cloud infrastructure further comprises sets of applications running on respective ones of the virtual machines and/or container sets.
These and other types of cloud infrastructure can be used to provide what is also referred to herein as a multi-tenant environment. One or more system elements such as the state prediction and failure prevention platform 110 or portions thereof are illustratively implemented for use by tenants of such a multi-tenant environment.
As mentioned previously, cloud infrastructure as disclosed herein can include cloud-based systems. Virtual machines provided in such systems can be used to implement at least portions of one or more of a computer system and a state prediction and failure prevention platform in illustrative embodiments. These and other cloud-based systems in illustrative embodiments can include object stores.
Illustrative embodiments of processing platforms will now be described in greater detail with reference to
The cloud infrastructure 1600 further comprises sets of applications 1610-1, 1610-2, . . . 1610-L running on respective ones of the VMs/container sets 1602-1, 1602-2, . . . 1602-L under the control of the virtualization infrastructure 1604. The VMs/container sets 1602 may comprise respective VMs, respective sets of one or more containers, or respective sets of one or more containers running in VMs.
In some implementations of the
In other implementations of the
As is apparent from the above, one or more of the processing modules or other components of system 100 may each run on a computer, server, storage device or other processing platform element. A given such element may be viewed as an example of what is more generally referred to herein as a “processing device.” The cloud infrastructure 1600 shown in
The processing platform 1700 in this embodiment comprises a portion of system 100 and includes a plurality of processing devices, denoted 1702-1, 1702-2, 1702-3, . . . 1702-K, which communicate with one another over a network 1704.
The network 1704 may comprise any type of network, including by way of example a global computer network such as the Internet, a WAN, a LAN, a satellite network, a telephone or cable network, a cellular network, a wireless network such as a WiFi or WiMAX network, or various portions or combinations of these and other types of networks.
The processing device 1702-1 in the processing platform 1700 comprises a processor 1710 coupled to a memory 1712. The processor 1710 may comprise a microprocessor, a microcontroller, an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), a central processing unit (CPU), a graphical processing unit (GPU), a tensor processing unit (TPU), a video processing unit (VPU) or other type of processing circuitry, as well as portions or combinations of such circuitry elements.
The memory 1712 may comprise random access memory (RAM), read-only memory (ROM), flash memory or other types of memory, in any combination. The memory 1712 and other memories disclosed herein should be viewed as illustrative examples of what are more generally referred to as “processor-readable storage media” storing executable program code of one or more software programs.
Articles of manufacture comprising such processor-readable storage media are considered illustrative embodiments. A given such article of manufacture may comprise, for example, a storage array, a storage disk or an integrated circuit containing RAM, ROM, flash memory or other electronic memory, or any of a wide variety of other types of computer program products. The term “article of manufacture” as used herein should be understood to exclude transitory, propagating signals. Numerous other types of computer program products comprising processor-readable storage media can be used.
Also included in the processing device 1702-1 is network interface circuitry 1714, which is used to interface the processing device with the network 1704 and other system components, and may comprise conventional transceivers.
The other processing devices 1702 of the processing platform 1700 are assumed to be configured in a manner similar to that shown for processing device 1702-1 in the figure.
Again, the particular processing platform 1700 shown in the figure is presented by way of example only, and system 100 may include additional or alternative processing platforms, as well as numerous distinct processing platforms in any combination, with each such platform comprising one or more computers, servers, storage devices or other processing devices.
For example, other processing platforms used to implement illustrative embodiments can comprise converged infrastructure.
It should therefore be understood that in other embodiments different arrangements of additional or alternative elements may be used. At least a subset of these elements may be collectively implemented on a common processing platform, or each such element may be implemented on a separate processing platform.
As indicated previously, components of an information processing system as disclosed herein can be implemented at least in part in the form of one or more software programs stored in memory and executed by a processor of a processing device. For example, at least portions of the functionality of one or more elements of the state prediction and failure prevention platform 110 as disclosed herein are illustratively implemented in the form of software running on one or more processing devices.
It should again be emphasized that the above-described embodiments are presented for purposes of illustration only. Many variations and other alternative embodiments may be used. For example, the disclosed techniques are applicable to a wide variety of other types of information processing systems and state prediction and failure prevention platforms. Also, the particular configurations of system and device elements and associated processing operations illustratively shown in the drawings can be varied in other embodiments. Moreover, the various assumptions made above in the course of describing the illustrative embodiments should also be viewed as exemplary rather than as requirements or limitations of the disclosure. Numerous other alternative embodiments within the scope of the appended claims will be readily apparent to those skilled in the art.