AUTOMATIC ERROR PREDICTION IN DATA CENTERS

Information

Patent Application
20230297453

References
Source

Publication Number
20230297453
Date Filed
February 28, 2022
3 years ago
Date Published
September 21, 2023
a year ago

Inventors
Original Assignees
- NVIDIA Corporation

CPC
- G06F11/004 - Error avoidance
- G06N20/20
- G06F2201/86 - Event-based monitoring
International Classifications
- G06F11/00
- G06N20/20

Information

Abstract

Apparatuses, systems, and techniques to predict a probability of an error or anomay in processing units, such as those of a data center. In at least one embodiment, the probability of an error occuring in a proccessing unit is identified using multiple trained machine learning models, in which the trained machine learning models each outputs, for example, the probability of an error occuring within a different predetermined time period.

Description

Claims

1. A method comprising: receiving telemetry data for a device, wherein the telemetry data is indicative of at least one aspect of an operation of the device;processing an input based on the telemetry data using a plurality of trained machine learning models to generate a plurality of error predictions, wherein the plurality of trained machine learning models comprises: a first trained machine learning model that outputs a first error prediction of the plurality of error predictions, the first error prediction comprising a first probability of an error occurring within a first future time period; anda second trained machine learning model that outputs a second error prediction of the plurality of error predictions, the second error prediction comprising a second probability of an error occurring within a second future time period; anddetermining whether to perform a preventative action on the device based on the plurality of error predictions.
2. The method of claim 1, wherein the first error prediction identifies a type of potential error that will occur within the first future time period, and wherein the second error prediction identifies a type of potential error that will occur within the second future time period.
3. The method of claim 1, wherein the plurality of trained machine learning models further comprises: a third trained machine learning model that outputs a third error prediction of the plurality of error predictions, the third error prediction comprising a third probability of an error occurring within a third future time period; anda fourth trained machine learning model that outputs a fourth error prediction of the plurality of error predictions, the fourth error prediction comprising a fourth probability of an error occurring within a fourth future time period.
4. The method of claim 1, wherein each of the plurality of trained machine learning models comprises a recurrent neural network.
5. The method of claim 1, wherein the device comprises a graphical processing unit.
6. The method of claim 1, further comprising: determining when to perform the preventative action based on the plurality of error predictions.
7. The method of claim 1, further comprising: periodically retraining the plurality of trained machine learning models based on telemetry data for a plurality of devices that share a common device type that was generated after the plurality of trained machine learning models were last trained.
8. The method of claim 1, wherein the telemetry data comprises a first parameter and a second set of parameters, the method further comprising: determining a first value of the first parameter from the telemetry data;estimating a second value for the first parameter based on inputting the values of the second set of parameters into a function that relates the first parameter to the second set of parameters;determining a difference between the first value and the second value; anddetermining whether an anomaly is detected based on the difference between the first value and the second value.
9. The method of claim 1, wherein performing the preventative action comprises providing a notification that the device is predicted to experience at least one of an error, a fault, or failure within the first future time period or the second future time period.
10. A non-transitory computer-readable medium comprising instructions that, responsive to execution by a processing device, cause the processing device to perform operations comprising: receiving historical telemetry data for a plurality of devices that share a common device type;training a plurality of machine learning models to generate error predictions for devices having the device type based on the historical telemetry data, wherein training the plurality of machine learning models comprises: training a first machine learning model to output a first error prediction comprising a first probability of an error occurring within a first time period; andtraining a second machine learning model to output a second error prediction comprising a second probability of an error occurring within a second time period.
11. The non-transitory computer-readable medium of claim 10, wherein the plurality of devices comprise a plurality of graphical processing units of a data center.
12. The non-transitory computer-readable medium of claim 10, further causing the processing device to perform operations comprising: processing an input based on telemetry data for a device of the plurality of devices using the trained plurality of machine learning models to output the first error prediction and the second error prediction; anddetermining whether to perform a preventative action on the device based on the first error prediction and the second error prediction.
13. The non-transitory computer-readable medium of claim 10, wherein the first error prediction identifies a type of potential error that is estimated to occur within the first time period, and wherein the second error prediction identifies a type of potential error that is estimated to occur within the second time period.
14. The non-transitory computer-readable medium of claim 10, wherein training the plurality of machine learning models further comprises: training a third machine learning model to output a third error prediction comprising a third probability of an error occurring within a third time period; andtraining a fourth machine learning model to output a fourth error prediction comprising a fourth probability of an error occurring within a fourth time period.
15. The non-transitory computer-readable medium of claim 10, further causing the processing device to perform operations comprising: periodically retraining the plurality of machine learning models based on telemetry data for the plurality of devices that was generated after the plurality of machine learning models were last trained.
16. A system comprising: a memory device; anda processing device coupled to the memory device, wherein the processing device is to perform operations comprising: for each device of a plurality of devices of a data center, perform the following: receive telemetry data indicative of at least one aspect of an operation of the device, wherein the telemetry data comprises a first parameter and a second set of parameters;determine a first value of the first parameter from the telemetry data;estimate a second value for the first parameter based on inputting the values of the second set of parameters into a function that relates the first parameter to the second set of parameters;determine a difference between the first value and the second value; anddetermine whether an anomaly is detected based at least in part on the difference between the first value and the second value.
17. The system of claim 16, wherein the processing device is further to: generate the function using historical telemetry data; andperiodically update the function using additional telemetry data received after the function was generated.
18. The system of claim 17, wherein the processing device is further to: estimate values for one or more additional parameters of the telemetry data from other parameters of the telemetry data using one or more additional functions;determine differences between the estimated values and measured values for the one or more additional parameters; anddetermine an anomaly score of the device based on a combination of the difference and the one or more additional differences;determine whether the anomaly is detected based on the anomaly score.
19. The system of claim 18, wherein the processing device is further to: determine a level of the anomaly of the device based on a location of the anomaly score of the device on a Gaussian distribution.
20. The system of claim 17, wherein the processing device is further to: process an input based on the telemetry data using a plurality of trained machine learning models to generate a plurality of error predictions, wherein the plurality of trained machine learning models comprises: a first trained machine learning model that outputs a first error prediction of the plurality of error predictions, the first error prediction comprising a first probability of an error occurring within a first future time period; anda second trained machine learning model that outputs a second error prediction of the plurality of error predictions, the second error prediction comprising a second probability of an error occurring within a second future time period; anddetermine whether to perform a preventative action on the device based on the plurality of error predictions.
21. The system of claim 20, wherein the processing device is further to: determine when to perform the preventative action based on the plurality of error predictions.

AUTOMATIC ERROR PREDICTION IN DATA CENTERS

Information

Publication Number

Date Filed

Date Published

Inventors

Original Assignees

CPC

International Classifications

Abstract

Description

Claims